How to use RecursiveCharacterTextSplitter in LangChain
Quick answer
Use
RecursiveCharacterTextSplitter in LangChain to split large documents into smaller chunks recursively by characters, preserving context. Instantiate it with parameters like chunk_size and chunk_overlap, then call split_text() or split_documents() on your input text or documents.PREREQUISITES
Python 3.8+pip install langchain>=0.2Basic familiarity with LangChain document loaders
Setup
Install LangChain if you haven't already. Ensure you have Python 3.8 or higher.
pip install langchain>=0.2 Step by step
Instantiate RecursiveCharacterTextSplitter with your desired chunk_size and chunk_overlap. Then use split_text() to split a string or split_documents() to split a list of LangChain Document objects.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
# Example text
text = """LangChain helps you build applications with LLMs.\n""" * 10 # repeated text
# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20
)
# Split plain text
chunks = splitter.split_text(text)
print(f"Number of chunks: {len(chunks)}")
print(chunks[0])
# Split LangChain Document objects
docs = [Document(page_content=text)]
split_docs = splitter.split_documents(docs)
print(f"Number of split documents: {len(split_docs)}")
print(split_docs[0].page_content) output
Number of chunks: 2 LangChain helps you build applications with LLMs. LangChain helps you build applications with LLMs. Number of split documents: 2 LangChain helps you build applications with LLMs.
Common variations
- Adjust
chunk_sizeandchunk_overlapto balance chunk length and context overlap. - Use
split_documents()when working with LangChainDocumentobjects to preserve metadata. - Combine with document loaders like
PyPDFLoaderfor PDFs before splitting.
Troubleshooting
- If chunks are too small or too large, adjust
chunk_sizeandchunk_overlap. - If you get errors on
split_documents(), ensure input is a list ofDocumentobjects. - For unexpected splits, check if your text contains unusual characters or formatting that affects splitting.
Key Takeaways
- Use RecursiveCharacterTextSplitter to recursively split text into manageable chunks preserving context.
- Set chunk_size and chunk_overlap to control chunk length and overlap for better downstream processing.
- Use split_documents for LangChain Document objects to keep metadata intact during splitting.