Best For Intermediate · 3 min read

Best chunk size for RAG

Q: Best chunk size for RAG

For RAG, the best chunk size typically ranges between 500 and 1000 tokens to balance retrieval relevance and context window limits. Using chunks around 800 tokens often yields optimal performance with modern LLMs and vector databases.

Quick answer

For RAG, the best chunk size typically ranges between 500 and 1000 tokens to balance retrieval relevance and context window limits. Using chunks around 800 tokens often yields optimal performance with modern LLMs and vector databases.

RECOMMENDATION

Use chunk sizes of approximately 800 tokens for RAG to maximize retrieval accuracy and maintain efficient context usage without excessive overlap or fragmentation.

Use case	Best choice	Why	Runner-up
General document RAG	`800 tokens`	Balances context window and retrieval precision well	`500 tokens`
Long technical manuals	`1000 tokens`	Captures detailed sections without too many splits	`800 tokens`
Short FAQs or chat logs	`300-500 tokens`	Keeps chunks concise for precise retrieval	`800 tokens`
Multimodal or image captioning RAG	`500-700 tokens`	Fits well with multimodal context limits	`800 tokens`
Low-latency applications	`400-600 tokens`	Smaller chunks reduce retrieval and embedding latency	`800 tokens`

Top picks explained

For RAG workflows, chunk size is critical to balance retrieval quality and model context limits. 800 tokens is the sweet spot for most use cases because it provides enough context for embeddings to capture semantic meaning without fragmenting documents excessively. Larger chunks like 1000 tokens work well for dense technical content, while smaller chunks around 300-500 tokens suit short FAQs or chat logs where precision is key.

Choosing chunk size depends on your LLM context window and vector store capabilities. Overly large chunks reduce retrieval granularity, while too small chunks increase overhead and noise.

In practice

Here is a Python example using OpenAI embeddings and chunking text into ~800 tokens for RAG indexing:

python

import os
from openai import OpenAI
import tiktoken

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Function to chunk text into ~800 tokens

def chunk_text(text, max_tokens=800):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i : i + max_tokens]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
    return chunks

# Example document
text = """Your long document text goes here..."""

chunks = chunk_text(text)

# Create embeddings for each chunk
embeddings = []
for chunk in chunks:
    response = client.embeddings.create(model="text-embedding-3-small", input=chunk)
    embeddings.append(response.data[0].embedding)

print(f"Created {len(chunks)} chunks with embeddings.")

output

Created X chunks with embeddings.

Pricing and limits

Option	Free	Cost	Limits	Context
`text-embedding-3-small`	Yes, limited free tier	$0.02 / 1K tokens	Max 8192 tokens input	Ideal for chunk embedding in RAG
Chunk size ~800 tokens	N/A	N/A	Fits well within typical 4K-8K token context windows	Balances retrieval granularity and context
Smaller chunks (~300 tokens)	N/A	N/A	More API calls, higher cost	Better for short precise queries
Larger chunks (~1000 tokens)	N/A	N/A	May reduce retrieval precision	Better for dense technical docs

What to avoid

Avoid very large chunks (>1000 tokens) as they reduce retrieval granularity and can exceed model context limits.
Do not use very small chunks (<300 tokens) excessively, as this increases embedding calls and noise.
Avoid inconsistent chunk sizes that cause overlap or gaps, which degrade retrieval quality.
Do not ignore your LLM's context window size when choosing chunk size.

How to evaluate for your case

Benchmark chunk sizes by measuring retrieval accuracy and latency on a validation set. Use metrics like recall@k and embedding cost. Experiment with chunk sizes from 300 to 1000 tokens and select the size that maximizes retrieval relevance while minimizing cost and latency.

✅

Key Takeaways

Use ~800 token chunks for balanced RAG retrieval and context usage.
Adjust chunk size based on document type and LLM context window.
Avoid too large or too small chunks to prevent retrieval quality loss.
Benchmark chunk sizes with your data for optimal results.

Verified 2026-04 · text-embedding-3-small, gpt-4o-mini

Verify ↗