How to split by tokens in LangChain
Quick answer
Use LangChain's
TokenTextSplitter to split text by tokens efficiently. It leverages tokenizers like tiktoken to chunk text based on token count, ensuring prompt size limits are respected.PREREQUISITES
Python 3.8+pip install langchain tiktokenOpenAI API key (free tier works) if using OpenAI tokenizer
Setup
Install LangChain and tiktoken for token-based splitting. Set your OpenAI API key as an environment variable for tokenizer compatibility.
pip install langchain tiktoken Step by step
Use TokenTextSplitter from LangChain to split text by tokens. Specify the tokenizer name and chunk size to control token limits.
from langchain.text_splitter import TokenTextSplitter
import os
# Example text
text = """LangChain helps you build applications with LLMs by managing prompts and token limits effectively."""
# Initialize TokenTextSplitter with OpenAI tokenizer
splitter = TokenTextSplitter(
encoding_name="cl100k_base", # OpenAI tokenizer for GPT-4o
chunk_size=10, # max tokens per chunk
chunk_overlap=0 # no overlap
)
# Split the text
chunks = splitter.split_text(text)
# Print chunks
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}: {chunk}") output
Chunk 1: LangChain helps you build applications Chunk 2: with LLMs by managing prompts and Chunk 3: token limits effectively.
Common variations
- Use different tokenizer encodings like
gpt2orr50k_basedepending on your model. - Adjust
chunk_sizeandchunk_overlapfor your use case. - For async or streaming, integrate with LangChain's async utilities but token splitting remains synchronous.
Troubleshooting
- If you get errors about unknown encoding, verify
encoding_namematches a validtiktokenencoding. - Ensure
tiktokenis installed and up to date. - For very large texts, increase
chunk_sizeor split text before token splitting.
Key Takeaways
- Use LangChain's TokenTextSplitter with a proper tokenizer to split text by tokens.
- Adjust chunk_size and chunk_overlap to control token chunking for prompt limits.
- Ensure the tokenizer encoding matches your target model's tokenization scheme.