How to chunk documents for RAG
Quick answer
To chunk documents for
RAG, split large texts into smaller, semantically coherent pieces (e.g., paragraphs or fixed token lengths) that fit within model context limits. Use consistent chunk sizes (typically 500-1000 tokens) to optimize embedding quality and retrieval relevance.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the necessary Python package and set your environment variable for the OpenAI API key.
pip install openai Step by step
This example shows how to chunk a document by splitting on paragraphs and limiting chunk size by tokens using the tiktoken tokenizer for OpenAI models.
import os
import tiktoken
# Sample document text
text = """Retrieval-Augmented Generation (RAG) combines retrieval with LLMs to improve accuracy.\n\nChunking documents properly is key to effective retrieval. Chunks should be semantically coherent and fit within token limits.\n\nCommon chunk sizes range from 500 to 1000 tokens depending on the model context window.\n\nYou can split by paragraphs or use sliding windows for overlap to preserve context."""
# Initialize tokenizer for gpt-4o
enc = tiktoken.encoding_for_model("gpt-4o")
# Split text into paragraphs
paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
chunks = []
current_chunk = []
current_tokens = 0
max_tokens = 500 # max tokens per chunk
for para in paragraphs:
para_tokens = len(enc.encode(para))
if current_tokens + para_tokens > max_tokens:
# Save current chunk
chunks.append(" ".join(current_chunk))
current_chunk = [para]
current_tokens = para_tokens
else:
current_chunk.append(para)
current_tokens += para_tokens
# Add last chunk
if current_chunk:
chunks.append(" ".join(current_chunk))
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i} (tokens: {len(enc.encode(chunk))}):\n{chunk}\n") output
Chunk 1 (tokens: 69): Retrieval-Augmented Generation (RAG) combines retrieval with LLMs to improve accuracy. Chunking documents properly is key to effective retrieval. Chunks should be semantically coherent and fit within token limits. Chunk 2 (tokens: 44): Common chunk sizes range from 500 to 1000 tokens depending on the model context window. You can split by paragraphs or use sliding windows for overlap to preserve context.
Common variations
You can chunk documents using different strategies:
- Fixed token windows: Split text into fixed-size token chunks with optional overlap for context continuity.
- Semantic chunking: Use NLP tools to split by sentences or topics for better semantic coherence.
- Streaming or async: Chunk documents on the fly when processing large corpora asynchronously.
Troubleshooting
If chunks are too large, embeddings may truncate or lose context, reducing retrieval quality. Reduce max_tokens or increase chunk overlap.
If chunks are too small, retrieval may become inefficient and noisy. Balance chunk size for your use case.
Key Takeaways
- Chunk documents into 500-1000 token pieces for optimal RAG embedding and retrieval.
- Use paragraph or semantic boundaries to keep chunks coherent and meaningful.
- Adjust chunk size and overlap based on your model's context window and retrieval needs.