How to Beginner to Intermediate · 3 min read

How to preprocess text for RAG

Q: How to preprocess text for RAG

Preprocessing text for RAG involves cleaning the raw text, splitting it into manageable chunks, and converting those chunks into vector embeddings for retrieval. This ensures the retriever can efficiently find relevant context to augment generation by the LLM.

Quick answer

Preprocessing text for RAG involves cleaning the raw text, splitting it into manageable chunks, and converting those chunks into vector embeddings for retrieval. This ensures the retriever can efficiently find relevant context to augment generation by the LLM.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install langchain>=0.2

Setup

Install necessary Python packages and set your environment variable for the OpenAI API key.

bash

pip install openai langchain

Step by step

This example shows how to preprocess text for RAG by cleaning, chunking, and embedding using langchain and OpenAI embeddings.

python

import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# Sample raw text
raw_text = """Retrieval-Augmented Generation (RAG) combines a retriever and a generator to improve LLM responses by grounding them in external documents."""

# Step 1: Clean text (basic example)
clean_text = raw_text.replace('\n', ' ').strip()

# Step 2: Chunk text into smaller pieces
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = text_splitter.split_text(clean_text)

# Step 3: Create embeddings for chunks
embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
chunk_embeddings = [embeddings.embed_query(chunk) for chunk in chunks]

print("Chunks:", chunks)
print("Embedding vector length:", len(chunk_embeddings[0]))

output

Chunks: ['Retrieval-Augmented Generation (RAG) combines a retriever and a generator to improve LLM responses by grounding them in external documents.']
Embedding vector length: 1536

Common variations

You can preprocess text asynchronously, use different chunking strategies (like sentence or token-based), or switch embedding models such as gpt-4o or claude-3-5-sonnet-20241022. Adjust chunk size and overlap based on your retrieval needs.

python

import asyncio
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

async def async_embed_text(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=40)
    chunks = text_splitter.split_text(text)
    embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
    # Simulate async embedding calls
    results = []
    for chunk in chunks:
        results.append(embeddings.embed_query(chunk))
    return chunks, results

async def main():
    text = "RAG improves LLMs by retrieving relevant documents dynamically."
    chunks, embeds = await async_embed_text(text)
    print("Chunks:", chunks)
    print("First embedding length:", len(embeds[0]))

asyncio.run(main())

output

Chunks: ['RAG improves LLMs by retrieving relevant documents dynamically.']
First embedding length: 1536

Troubleshooting

If embeddings return empty or errors, verify your OPENAI_API_KEY is set correctly in the environment.
If chunks are too large or too small, adjust chunk_size and chunk_overlap in your splitter.
For noisy text, add preprocessing steps like lowercasing, removing special characters, or normalizing whitespace.

✅

Key Takeaways

Clean and normalize raw text before chunking to improve retrieval quality.
Use chunking with overlap to preserve context across document splits.
Generate vector embeddings for each chunk to enable efficient similarity search.
Adjust chunk size and embedding model based on your RAG system’s latency and accuracy needs.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗