How to split text in LangChain
Quick answer
Use LangChain's built-in text splitter classes like
RecursiveCharacterTextSplitter or CharacterTextSplitter to divide large text into manageable chunks. These classes allow you to specify chunk size and overlap, enabling efficient processing and embedding generation.PREREQUISITES
Python 3.8+pip install langchain>=0.2Basic Python knowledge
Setup
Install LangChain and set up your Python environment to use text splitting utilities.
pip install langchain>=0.2 Step by step
This example demonstrates how to split a long text into chunks using RecursiveCharacterTextSplitter with a chunk size of 1000 characters and an overlap of 200 characters.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """LangChain is a powerful framework for building applications with language models. """ * 50 # Sample repeated text
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1} (length {len(chunk)}):")
print(chunk[:200] + '...') # Print first 200 chars
print('---') output
Chunk 1 (length 1000): LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models.... --- Chunk 2 (length 1000): LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models. LangChain is a powerful framework for building applications with language models.... ---
Common variations
- Use
CharacterTextSplitterfor simpler splitting by characters without recursion. - Adjust
chunk_sizeandchunk_overlapto balance chunk length and context overlap. - Use
split_documentsmethod to split a list ofDocumentobjects instead of raw text.
from langchain.text_splitter import CharacterTextSplitter
text = "This is a sample text to demonstrate splitting. " * 20
splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)
print(f"Total chunks: {len(chunks)}")
print(chunks[0]) output
Total chunks: 9 This is a sample text to demonstrate splitting. This is a sample text to d...
Troubleshooting
- If chunks are too small or too large, adjust
chunk_sizeandchunk_overlapparameters. - Ensure your input text is a string; otherwise,
split_textwill raise an error. - For very large documents, consider splitting before embedding to avoid token limits.
Key Takeaways
- Use LangChain's text splitters like
RecursiveCharacterTextSplitterto chunk text efficiently. - Adjust
chunk_sizeandchunk_overlapto optimize chunk length and context retention. - Split raw text or
Documentobjects depending on your workflow needs.