How to use RecursiveCharacterTextSplitter in LangChain
Quick answer
Use
RecursiveCharacterTextSplitter from langchain.text_splitter to split large documents into smaller chunks recursively by characters, preserving semantic boundaries. Instantiate it with parameters like chunk_size and chunk_overlap, then call split_text() on your input string to get the chunks.PREREQUISITES
Python 3.8+pip install langchain>=0.2Basic knowledge of Python
Setup
Install LangChain if you haven't already. Ensure you have Python 3.8 or newer.
pip install langchain>=0.2 Step by step
Import RecursiveCharacterTextSplitter, create an instance with your desired chunk_size and chunk_overlap, then split your text.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """LangChain is a powerful framework for building applications with language models. """
text += """It provides utilities for text splitting, prompt management, and chaining calls to LLMs."""
# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
# Split the text into chunks
chunks = splitter.split_text(text)
# Print the chunks
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}:", chunk) output
Chunk 1: LangChain is a powerful framework for building Chunk 2: applications with language models. It provides Chunk 3: utilities for text splitting, prompt management, Chunk 4: and chaining calls to LLMs.
Common variations
- You can adjust
chunk_sizeandchunk_overlapto control chunk length and overlap. - Use
split_documents()to split a list ofDocumentobjects. - Combine with LangChain's document loaders and embeddings for retrieval-augmented generation.
from langchain.schema import Document
# Example splitting multiple documents
docs = [Document(page_content=text)]
chunks_docs = splitter.split_documents(docs)
for i, doc in enumerate(chunks_docs, 1):
print(f"Doc chunk {i}:", doc.page_content) output
Doc chunk 1: LangChain is a powerful framework for building Doc chunk 2: applications with language models. It provides Doc chunk 3: utilities for text splitting, prompt management, Doc chunk 4: and chaining calls to LLMs.
Troubleshooting
- If chunks are too small or too large, adjust
chunk_sizeandchunk_overlap. - Ensure input text is a string; otherwise,
split_text()will raise an error. - For very large texts, consider increasing
chunk_sizeto reduce the number of chunks.
Key Takeaways
- Use
RecursiveCharacterTextSplitterto split text recursively by characters with overlap. - Adjust
chunk_sizeandchunk_overlapto optimize chunk length for your use case. - You can split both raw text strings and LangChain
Documentobjects. - Proper chunking improves downstream tasks like embeddings and retrieval.
- Always validate input types and tune parameters for best results.