Best For Intermediate · 3 min read

Best chunk size for RAG

Quick answer
For RAG, the best chunk size typically ranges between 500 and 1000 tokens to balance retrieval relevance and context window limits. Using chunks around 800 tokens often yields optimal performance with modern LLMs and vector databases.

RECOMMENDATION

Use chunk sizes of approximately 800 tokens for RAG to maximize retrieval accuracy and maintain efficient context usage without excessive overlap or fragmentation.
Use caseBest choiceWhyRunner-up
General document RAG800 tokensBalances context window and retrieval precision well500 tokens
Long technical manuals1000 tokensCaptures detailed sections without too many splits800 tokens
Short FAQs or chat logs300-500 tokensKeeps chunks concise for precise retrieval800 tokens
Multimodal or image captioning RAG500-700 tokensFits well with multimodal context limits800 tokens
Low-latency applications400-600 tokensSmaller chunks reduce retrieval and embedding latency800 tokens

Top picks explained

For RAG workflows, chunk size is critical to balance retrieval quality and model context limits. 800 tokens is the sweet spot for most use cases because it provides enough context for embeddings to capture semantic meaning without fragmenting documents excessively. Larger chunks like 1000 tokens work well for dense technical content, while smaller chunks around 300-500 tokens suit short FAQs or chat logs where precision is key.

Choosing chunk size depends on your LLM context window and vector store capabilities. Overly large chunks reduce retrieval granularity, while too small chunks increase overhead and noise.

In practice

Here is a Python example using OpenAI embeddings and chunking text into ~800 tokens for RAG indexing:

python
import os
from openai import OpenAI
import tiktoken

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Function to chunk text into ~800 tokens

def chunk_text(text, max_tokens=800):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i : i + max_tokens]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
    return chunks

# Example document
text = """Your long document text goes here..."""

chunks = chunk_text(text)

# Create embeddings for each chunk
embeddings = []
for chunk in chunks:
    response = client.embeddings.create(model="text-embedding-3-small", input=chunk)
    embeddings.append(response.data[0].embedding)

print(f"Created {len(chunks)} chunks with embeddings.")
output
Created X chunks with embeddings.

Pricing and limits

OptionFreeCostLimitsContext
text-embedding-3-smallYes, limited free tier$0.02 / 1K tokensMax 8192 tokens inputIdeal for chunk embedding in RAG
Chunk size ~800 tokensN/AN/AFits well within typical 4K-8K token context windowsBalances retrieval granularity and context
Smaller chunks (~300 tokens)N/AN/AMore API calls, higher costBetter for short precise queries
Larger chunks (~1000 tokens)N/AN/AMay reduce retrieval precisionBetter for dense technical docs

What to avoid

  • Avoid very large chunks (>1000 tokens) as they reduce retrieval granularity and can exceed model context limits.
  • Do not use very small chunks (<300 tokens) excessively, as this increases embedding calls and noise.
  • Avoid inconsistent chunk sizes that cause overlap or gaps, which degrade retrieval quality.
  • Do not ignore your LLM's context window size when choosing chunk size.

How to evaluate for your case

Benchmark chunk sizes by measuring retrieval accuracy and latency on a validation set. Use metrics like recall@k and embedding cost. Experiment with chunk sizes from 300 to 1000 tokens and select the size that maximizes retrieval relevance while minimizing cost and latency.

Key Takeaways

  • Use ~800 token chunks for balanced RAG retrieval and context usage.
  • Adjust chunk size based on document type and LLM context window.
  • Avoid too large or too small chunks to prevent retrieval quality loss.
  • Benchmark chunk sizes with your data for optimal results.
Verified 2026-04 · text-embedding-3-small, gpt-4o-mini
Verify ↗