Concept Intermediate · 3 min read

What is late chunking in RAG

Quick answer
Late chunking in Retrieval-Augmented Generation (RAG) is a method where documents are retrieved as whole units first, and then chunked into smaller pieces only after retrieval. This contrasts with early chunking, improving retrieval relevance and reducing unnecessary processing by chunking only the most relevant documents.
Late chunking in Retrieval-Augmented Generation (RAG) is a technique that delays splitting documents into chunks until after retrieval, optimizing relevance and efficiency.

How it works

In RAG, the system combines a retrieval component with a language model to generate answers grounded in external documents. Late chunking means the retrieval system first fetches entire documents based on the query, then the documents are split into smaller chunks only after retrieval. This is like first selecting the most relevant books from a library, then opening those books to find the exact pages needed, rather than pre-splitting every book into pages before searching.

This approach reduces the number of chunks the system processes, improving efficiency and relevance because chunking happens only on documents already deemed relevant.

Concrete example

Suppose you have a knowledge base of 3 documents:

  • Doc 1: "History of AI and its evolution."
  • Doc 2: "Applications of AI in healthcare."
  • Doc 3: "Future trends in AI technology."

With late chunking, the retrieval system first selects Doc 2 and Doc 3 as relevant to the query "AI healthcare applications." Then, only these two documents are chunked into smaller passages for the language model to process.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Retrieve whole documents (simulated retrieval)
retrieved_docs = [
    "Applications of AI in healthcare.",
    "Future trends in AI technology."
]

# Step 2: Late chunking - split retrieved docs into chunks
chunks = []
for doc in retrieved_docs:
    # Simple chunking by sentences
    chunks.extend(doc.split('. '))

# Step 3: Use chunks as context for generation
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain AI applications in healthcare using the following info: " + ' | '.join(chunks)}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

print(response.choices[0].message.content)
output
AI applications in healthcare include diagnostics, personalized treatment plans, and patient monitoring using machine learning models to analyze medical data effectively.

When to use it

Use late chunking in RAG when your document collection contains large documents and you want to optimize retrieval efficiency by avoiding chunking the entire corpus upfront. It is ideal when retrieval quality benefits from whole-document context before chunking.

Do not use late chunking if your retrieval system requires chunk-level indexing or if you need very fine-grained retrieval from the start, as early chunking supports that better.

Key terms

TermDefinition
Retrieval-Augmented Generation (RAG)An AI architecture combining retrieval of documents with language model generation.
ChunkingSplitting documents into smaller pieces or passages for processing.
Late chunkingDelaying chunking until after document retrieval to improve efficiency and relevance.
Early chunkingChunking documents before retrieval, indexing chunks individually.

Key Takeaways

  • Late chunking improves retrieval efficiency by chunking only retrieved documents, not the entire corpus.
  • It preserves whole-document context during retrieval, enhancing relevance.
  • Use late chunking when documents are large and retrieval quality depends on full-document context.
  • Avoid late chunking if your system requires chunk-level indexing or very fine-grained retrieval upfront.
Verified 2026-04 · gpt-4o
Verify ↗