How to Beginner to Intermediate · 3 min read

How to preprocess text for RAG

Quick answer
Preprocessing text for RAG involves cleaning the raw text, splitting it into manageable chunks, and converting those chunks into vector embeddings for retrieval. This ensures the retriever can efficiently find relevant context to augment generation by the LLM.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install langchain>=0.2

Setup

Install necessary Python packages and set your environment variable for the OpenAI API key.

bash
pip install openai langchain

Step by step

This example shows how to preprocess text for RAG by cleaning, chunking, and embedding using langchain and OpenAI embeddings.

python
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# Sample raw text
raw_text = """Retrieval-Augmented Generation (RAG) combines a retriever and a generator to improve LLM responses by grounding them in external documents."""

# Step 1: Clean text (basic example)
clean_text = raw_text.replace('\n', ' ').strip()

# Step 2: Chunk text into smaller pieces
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = text_splitter.split_text(clean_text)

# Step 3: Create embeddings for chunks
embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
chunk_embeddings = [embeddings.embed_query(chunk) for chunk in chunks]

print("Chunks:", chunks)
print("Embedding vector length:", len(chunk_embeddings[0]))
output
Chunks: ['Retrieval-Augmented Generation (RAG) combines a retriever and a generator to improve LLM responses by grounding them in external documents.']
Embedding vector length: 1536

Common variations

You can preprocess text asynchronously, use different chunking strategies (like sentence or token-based), or switch embedding models such as gpt-4o or claude-3-5-sonnet-20241022. Adjust chunk size and overlap based on your retrieval needs.

python
import asyncio
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

async def async_embed_text(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=40)
    chunks = text_splitter.split_text(text)
    embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])
    # Simulate async embedding calls
    results = []
    for chunk in chunks:
        results.append(embeddings.embed_query(chunk))
    return chunks, results

async def main():
    text = "RAG improves LLMs by retrieving relevant documents dynamically."
    chunks, embeds = await async_embed_text(text)
    print("Chunks:", chunks)
    print("First embedding length:", len(embeds[0]))

asyncio.run(main())
output
Chunks: ['RAG improves LLMs by retrieving relevant documents dynamically.']
First embedding length: 1536

Troubleshooting

  • If embeddings return empty or errors, verify your OPENAI_API_KEY is set correctly in the environment.
  • If chunks are too large or too small, adjust chunk_size and chunk_overlap in your splitter.
  • For noisy text, add preprocessing steps like lowercasing, removing special characters, or normalizing whitespace.

Key Takeaways

  • Clean and normalize raw text before chunking to improve retrieval quality.
  • Use chunking with overlap to preserve context across document splits.
  • Generate vector embeddings for each chunk to enable efficient similarity search.
  • Adjust chunk size and embedding model based on your RAG system’s latency and accuracy needs.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗