How to set chunk overlap in LangChain
Quick answer
In LangChain, set chunk overlap by specifying the
chunk_overlap parameter in text splitter classes like RecursiveCharacterTextSplitter. This controls how many characters overlap between chunks, preserving context across splits.PREREQUISITES
Python 3.8+pip install langchain>=0.2Basic familiarity with LangChain text splitting
Setup
Install LangChain if you haven't already. Use the following command to install the latest LangChain package:
pip install langchain>=0.2 Step by step
Use the RecursiveCharacterTextSplitter from langchain.text_splitter and set the chunk_overlap parameter to control how many characters overlap between chunks. This helps maintain context when processing long documents.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """LangChain is a powerful framework for building applications with language models. """ \
"When splitting text, chunk overlap helps preserve context across chunks. """ \
"This example shows how to set chunk overlap in LangChain."""
# Initialize the text splitter with chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=50,
chunk_overlap=10
)
# Split the text
chunks = text_splitter.split_text(text)
# Print the chunks
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}: {chunk}\n") output
Chunk 1: LangChain is a powerful framework for building applications Chunk 2: applications with language models. When splitting text, chunk overlap Chunk 3: overlap helps preserve context across chunks. This example shows how Chunk 4: how to set chunk overlap in LangChain.
Common variations
You can use other text splitters like CharacterTextSplitter or MarkdownTextSplitter which also support chunk_overlap. Adjust chunk_size and chunk_overlap based on your use case and model context window.
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
chunk_size=100,
chunk_overlap=20
)
chunks = text_splitter.split_text(text)
print(f"Number of chunks: {len(chunks)}") output
Number of chunks: 2
Troubleshooting
- If chunks are too small or too many, increase
chunk_sizeor decreasechunk_overlap. - If context is lost between chunks, increase
chunk_overlapto preserve more text. - Ensure your
chunk_overlapis less thanchunk_sizeto avoid errors.
Key Takeaways
- Set
chunk_overlapin LangChain text splitters to preserve context between chunks. - Use
RecursiveCharacterTextSplitterorCharacterTextSplitterwith overlap for flexible chunking. - Adjust
chunk_sizeandchunk_overlapbased on your model's context window and task needs.