How to beginner · 3 min read

How to chunk documents for RAG

Q: How to chunk documents for RAG

To chunk documents for RAG, split large texts into smaller, semantically coherent pieces (e.g., paragraphs or fixed token lengths) that fit within model context limits. Use consistent chunk sizes (typically 500-1000 tokens) to optimize embedding quality and retrieval relevance.

Quick answer

To chunk documents for RAG, split large texts into smaller, semantically coherent pieces (e.g., paragraphs or fixed token lengths) that fit within model context limits. Use consistent chunk sizes (typically 500-1000 tokens) to optimize embedding quality and retrieval relevance.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the necessary Python package and set your environment variable for the OpenAI API key.

bash

pip install openai

Step by step

This example shows how to chunk a document by splitting on paragraphs and limiting chunk size by tokens using the tiktoken tokenizer for OpenAI models.

python

import os
import tiktoken

# Sample document text
text = """Retrieval-Augmented Generation (RAG) combines retrieval with LLMs to improve accuracy.\n\nChunking documents properly is key to effective retrieval. Chunks should be semantically coherent and fit within token limits.\n\nCommon chunk sizes range from 500 to 1000 tokens depending on the model context window.\n\nYou can split by paragraphs or use sliding windows for overlap to preserve context."""

# Initialize tokenizer for gpt-4o
enc = tiktoken.encoding_for_model("gpt-4o")

# Split text into paragraphs
paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]

chunks = []
current_chunk = []
current_tokens = 0
max_tokens = 500  # max tokens per chunk

for para in paragraphs:
    para_tokens = len(enc.encode(para))
    if current_tokens + para_tokens > max_tokens:
        # Save current chunk
        chunks.append(" ".join(current_chunk))
        current_chunk = [para]
        current_tokens = para_tokens
    else:
        current_chunk.append(para)
        current_tokens += para_tokens

# Add last chunk
if current_chunk:
    chunks.append(" ".join(current_chunk))

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} (tokens: {len(enc.encode(chunk))}):\n{chunk}\n")

output

Chunk 1 (tokens: 69):
Retrieval-Augmented Generation (RAG) combines retrieval with LLMs to improve accuracy.

Chunking documents properly is key to effective retrieval. Chunks should be semantically coherent and fit within token limits.

Chunk 2 (tokens: 44):
Common chunk sizes range from 500 to 1000 tokens depending on the model context window.

You can split by paragraphs or use sliding windows for overlap to preserve context.

Common variations

You can chunk documents using different strategies:

Fixed token windows: Split text into fixed-size token chunks with optional overlap for context continuity.
Semantic chunking: Use NLP tools to split by sentences or topics for better semantic coherence.
Streaming or async: Chunk documents on the fly when processing large corpora asynchronously.

Troubleshooting

If chunks are too large, embeddings may truncate or lose context, reducing retrieval quality. Reduce max_tokens or increase chunk overlap.

If chunks are too small, retrieval may become inefficient and noisy. Balance chunk size for your use case.

✅

Key Takeaways

Chunk documents into 500-1000 token pieces for optimal RAG embedding and retrieval.
Use paragraph or semantic boundaries to keep chunks coherent and meaningful.
Adjust chunk size and overlap based on your model's context window and retrieval needs.

Verified 2026-04 · gpt-4o

Verify ↗