What chunk size to use for RAG
Quick answer
Use chunk sizes between 500 and 1000 tokens for
RAG to balance retrieval relevance and LLM input limits. Smaller chunks improve retrieval precision but increase index size; larger chunks reduce index size but risk losing fine-grained context.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable.
pip install openai>=1.0 Step by step
This example demonstrates splitting a document into chunks of 750 tokens for RAG, then indexing and querying with OpenAI's gpt-4o model.
import os
from openai import OpenAI
# Initialize client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example document text
text = """Your long document text goes here. It can be several thousand tokens long, and will be split into chunks for retrieval."""
# Simple tokenizer approximation: split by whitespace and rejoin
words = text.split()
chunk_size = 750
chunks = [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]
print(f"Total chunks created: {len(chunks)}")
# Example: query the first chunk with GPT-4o
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": chunks[0]}]
)
print("Response from model:", response.choices[0].message.content) output
Total chunks created: 3 Response from model: <model's generated text based on first chunk>
Common variations
You can adjust chunk size based on your use case:
- Smaller chunks (300-500 tokens): Better retrieval precision, more granular context, but larger index size.
- Larger chunks (800-1000 tokens): Fewer chunks, faster retrieval, but risk losing fine details.
- Use semantic chunking: Split on natural boundaries like paragraphs or sections instead of fixed token counts.
- Async or streaming: Use async SDK calls or streaming completions for large-scale RAG pipelines.
Troubleshooting
If you see poor retrieval relevance, try reducing chunk size to capture finer context.
If you hit token limits on your LLM calls, reduce chunk size or summarize chunks before indexing.
Ensure consistent tokenization method between chunking and embedding generation to avoid mismatches.
Key Takeaways
- Use chunk sizes between 500 and 1000 tokens for balanced RAG performance.
- Smaller chunks improve retrieval precision but increase index size and cost.
- Larger chunks reduce index size but may lose fine-grained context.
- Semantic chunking on natural boundaries often yields better results than fixed sizes.
- Adjust chunk size based on your LLM token limits and retrieval quality needs.