How to beginner · 3 min read

What chunk size to use for RAG

Q: What chunk size to use for RAG

Use chunk sizes between 500 and 1000 tokens for RAG to balance retrieval relevance and LLM input limits. Smaller chunks improve retrieval precision but increase index size; larger chunks reduce index size but risk losing fine-grained context.

Quick answer

Use chunk sizes between 500 and 1000 tokens for RAG to balance retrieval relevance and LLM input limits. Smaller chunks improve retrieval precision but increase index size; larger chunks reduce index size but risk losing fine-grained context.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable.

bash

pip install openai>=1.0

Step by step

This example demonstrates splitting a document into chunks of 750 tokens for RAG, then indexing and querying with OpenAI's gpt-4o model.

python

import os
from openai import OpenAI

# Initialize client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example document text
text = """Your long document text goes here. It can be several thousand tokens long, and will be split into chunks for retrieval."""

# Simple tokenizer approximation: split by whitespace and rejoin
words = text.split()
chunk_size = 750
chunks = [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

print(f"Total chunks created: {len(chunks)}")

# Example: query the first chunk with GPT-4o
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": chunks[0]}]
)
print("Response from model:", response.choices[0].message.content)

output

Total chunks created: 3
Response from model: <model's generated text based on first chunk>

Common variations

You can adjust chunk size based on your use case:

Smaller chunks (300-500 tokens): Better retrieval precision, more granular context, but larger index size.
Larger chunks (800-1000 tokens): Fewer chunks, faster retrieval, but risk losing fine details.
Use semantic chunking: Split on natural boundaries like paragraphs or sections instead of fixed token counts.
Async or streaming: Use async SDK calls or streaming completions for large-scale RAG pipelines.

Troubleshooting

If you see poor retrieval relevance, try reducing chunk size to capture finer context.

If you hit token limits on your LLM calls, reduce chunk size or summarize chunks before indexing.

Ensure consistent tokenization method between chunking and embedding generation to avoid mismatches.

✅

Key Takeaways

Use chunk sizes between 500 and 1000 tokens for balanced RAG performance.
Smaller chunks improve retrieval precision but increase index size and cost.
Larger chunks reduce index size but may lose fine-grained context.
Semantic chunking on natural boundaries often yields better results than fixed sizes.
Adjust chunk size based on your LLM token limits and retrieval quality needs.

Verified 2026-04 · gpt-4o

Verify ↗