What is chunking in RAG
Retrieval-Augmented Generation (RAG), chunking is the process of splitting large documents into smaller, manageable pieces called chunks. This enables efficient retrieval of relevant information by vector search and improves the quality of generated answers by providing focused context to the language model.Chunking is a technique that splits large documents into smaller segments to optimize retrieval and context usage in Retrieval-Augmented Generation (RAG) systems.How it works
Chunking breaks down large texts into smaller, coherent pieces, typically paragraphs or fixed-length text segments. These chunks are then embedded into vectors for similarity search. When a query is made, the system retrieves the most relevant chunks instead of the entire document, providing focused context to the language model. This is like using index cards for a book: instead of flipping through the whole book, you quickly find the relevant cards with key information.
Concrete example
Here is a Python example using OpenAI SDK to chunk a document and retrieve relevant chunks for a query in a RAG pipeline:
import os
from openai import OpenAI
# Sample document
document = """Retrieval-Augmented Generation (RAG) combines retrieval systems with language models. """ \
"Chunking splits documents into smaller pieces for efficient retrieval. """
"Each chunk is embedded and indexed for similarity search."
# Simple chunking by splitting on sentences
chunks = document.split('. ')
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Embed each chunk
embeddings = []
for chunk in chunks:
response = client.embeddings.create(model="text-embedding-3-small", input=chunk)
embeddings.append(response.data[0].embedding)
# Simulate a query embedding (in practice, embed the query similarly)
query = "How does chunking help retrieval?"
query_embedding = client.embeddings.create(model="text-embedding-3-small", input=query).data[0].embedding
# Simple similarity search (cosine similarity)
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
scores = [cosine_similarity(query_embedding, e) for e in embeddings]
# Retrieve top chunk
top_chunk = chunks[np.argmax(scores)]
print("Top relevant chunk:", top_chunk) Top relevant chunk: Chunking splits documents into smaller pieces for efficient retrieval
When to use it
Use chunking in RAG when working with large documents or corpora that exceed the token limits of language models. It improves retrieval speed and relevance by narrowing context to meaningful segments. Avoid chunking when documents are already short or when end-to-end context is critical and fits within model limits.
Key terms
| Term | Definition |
|---|---|
| Chunking | Splitting large documents into smaller segments for retrieval. |
| Retrieval-Augmented Generation (RAG) | An AI architecture combining retrieval systems with language models. |
| Embedding | A vector representation of text used for similarity search. |
| Vector Search | Finding relevant chunks by comparing vector embeddings. |
| Language Model | An AI model that generates text based on input context. |
Key Takeaways
- Chunking breaks large documents into smaller pieces to optimize retrieval in RAG.
- Embedding chunks enables efficient similarity search for relevant context.
- Use chunking when documents exceed model token limits or for faster retrieval.
- Focused chunks improve language model answer quality by providing precise context.