How to chunk documents for embeddings
Quick answer
Chunk documents into smaller, semantically coherent pieces (e.g., paragraphs or fixed token lengths) before generating embeddings. Use overlapping chunks to preserve context and avoid cutting off important information, which improves vector search and retrieval quality.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai>=1.0 Step by step
Chunk documents by splitting text into overlapping segments of a fixed token or character length, then generate embeddings for each chunk. This preserves semantic context and improves retrieval accuracy.
import os
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def chunk_text(text, chunk_size=500, overlap=50):
"""Split text into chunks with overlap."""
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + chunk_size, text_length)
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - overlap
return chunks
# Example document
document = (
"OpenAI provides powerful AI models for natural language processing. "
"Chunking documents helps create better embeddings by preserving context. "
"Embeddings convert text into vectors for semantic search and retrieval."
)
# Chunk the document
chunks = chunk_text(document, chunk_size=50, overlap=10)
# Generate embeddings for each chunk
embeddings = []
for chunk in chunks:
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunk
)
vector = response.data[0].embedding
embeddings.append(vector)
print(f"Number of chunks: {len(chunks)}")
print(f"First chunk: {chunks[0]}") output
Number of chunks: 5 First chunk: OpenAI provides powerful AI models for natural language
Common variations
You can chunk by paragraphs or sentences using NLP libraries like nltk or spacy for more semantic coherence. Adjust chunk_size and overlap based on model token limits (e.g., 1000 tokens max). For large documents, process chunks asynchronously or in batches to optimize throughput.
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
# Chunk by sentences grouped into chunks of ~100 tokens
def chunk_by_sentences(text, max_tokens=100):
sentences = sent_tokenize(text)
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
sentence_length = len(sentence.split())
if current_length + sentence_length > max_tokens:
chunks.append(' '.join(current_chunk))
current_chunk = [sentence]
current_length = sentence_length
else:
current_chunk.append(sentence)
current_length += sentence_length
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
# Example usage
text = "Your long document text here."
sentence_chunks = chunk_by_sentences(text)
print(f"Chunks created: {len(sentence_chunks)}") output
Chunks created: 3
Troubleshooting
- If embeddings are poor, increase chunk overlap to preserve context.
- For very large documents, watch for API token limits and split accordingly.
- Ensure text encoding is consistent (UTF-8) to avoid errors.
- If you get rate limit errors, implement retries or batch requests.
Key Takeaways
- Chunk documents into overlapping segments to maintain semantic context for embeddings.
- Adjust chunk size and overlap based on model token limits and document structure.
- Use sentence or paragraph boundaries for more natural chunking with NLP tools.
- Batch or async embedding calls improve performance on large datasets.
- Monitor API limits and encoding to avoid common errors during embedding generation.