How to use SemanticChunker in LangChain
Quick answer
Use SemanticChunker in LangChain to split documents into semantically meaningful chunks by leveraging embeddings and similarity. Initialize it with an embedding model and call chunk() on your text to get context-aware chunks.
PREREQUISITES
Python 3.8+pip install langchain>=0.2.0OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install langchain and openai packages, and set your OpenAI API key as an environment variable.
- Install packages:
pip install langchain openai - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install langchain openai Step by step
This example shows how to use SemanticChunker with OpenAIEmbeddings to chunk a long text semantically.
import os
from langchain_community.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Initialize embeddings with OpenAI
embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
# Create SemanticChunker with embeddings
chunker = SemanticChunker(embedding=embeddings, chunk_size=500, chunk_overlap=50)
# Example long text
text = (
"LangChain is a framework for developing applications powered by language models. "
"SemanticChunker splits text into chunks based on semantic similarity rather than fixed length, "
"improving retrieval and context relevance in downstream tasks."
)
# Chunk the text
chunks = chunker.chunk(text)
# Print chunks
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}: {chunk}\n") output
Chunk 1: LangChain is a framework for developing applications powered by language models. Chunk 2: SemanticChunker splits text into chunks based on semantic similarity rather than fixed length, improving retrieval and context relevance in downstream tasks.
Common variations
- Adjust
chunk_sizeandchunk_overlapto control chunk length and overlap. - Use different embedding models by passing other
Embeddingimplementations. - Integrate
SemanticChunkerwith LangChain document loaders and retrievers for enhanced pipelines.
Troubleshooting
- If chunks are too small or too large, tune
chunk_sizeandchunk_overlap. - Ensure your OpenAI API key is set correctly in
os.environ["OPENAI_API_KEY"]. - Check network connectivity if embedding calls fail.
Key Takeaways
- Use SemanticChunker with embeddings to split text semantically, improving context relevance.
- Tune chunk_size and chunk_overlap parameters for optimal chunk granularity.
- Integrate SemanticChunker with LangChain pipelines for better document retrieval and QA.