How to Intermediate · 4 min read

How to chunk documents for embeddings

Quick answer

Chunk documents into smaller, semantically coherent pieces (e.g., paragraphs or fixed token lengths) before generating embeddings. Use overlapping chunks to preserve context and avoid cutting off important information, which improves vector search and retrieval quality.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

Chunk documents by splitting text into overlapping segments of a fixed token or character length, then generate embeddings for each chunk. This preserves semantic context and improves retrieval accuracy.

python

import os
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into chunks with overlap."""
    chunks = []
    start = 0
    text_length = len(text)
    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

# Example document
document = (
    "OpenAI provides powerful AI models for natural language processing. "
    "Chunking documents helps create better embeddings by preserving context. "
    "Embeddings convert text into vectors for semantic search and retrieval."
)

# Chunk the document
chunks = chunk_text(document, chunk_size=50, overlap=10)

# Generate embeddings for each chunk
embeddings = []
for chunk in chunks:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunk
    )
    vector = response.data[0].embedding
    embeddings.append(vector)

print(f"Number of chunks: {len(chunks)}")
print(f"First chunk: {chunks[0]}")

output

Number of chunks: 5
First chunk: OpenAI provides powerful AI models for natural language

Common variations

You can chunk by paragraphs or sentences using NLP libraries like nltk or spacy for more semantic coherence. Adjust chunk_size and overlap based on model token limits (e.g., 1000 tokens max). For large documents, process chunks asynchronously or in batches to optimize throughput.

python

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Chunk by sentences grouped into chunks of ~100 tokens

def chunk_by_sentences(text, max_tokens=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(sentence.split())
        if current_length + sentence_length > max_tokens:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_length = sentence_length
        else:
            current_chunk.append(sentence)
            current_length += sentence_length
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    return chunks

# Example usage
text = "Your long document text here."
sentence_chunks = chunk_by_sentences(text)
print(f"Chunks created: {len(sentence_chunks)}")

output

Chunks created: 3

Troubleshooting

If embeddings are poor, increase chunk overlap to preserve context.
For very large documents, watch for API token limits and split accordingly.
Ensure text encoding is consistent (UTF-8) to avoid errors.
If you get rate limit errors, implement retries or batch requests.

✅

Key Takeaways

Chunk documents into overlapping segments to maintain semantic context for embeddings.
Adjust chunk size and overlap based on model token limits and document structure.
Use sentence or paragraph boundaries for more natural chunking with NLP tools.
Batch or async embedding calls improve performance on large datasets.
Monitor API limits and encoding to avoid common errors during embedding generation.

Verified 2026-04 · text-embedding-3-small

Verify ↗