How to do semantic chunking in Python
Quick answer
Semantic chunking in Python involves splitting text into meaningful segments based on context rather than fixed sizes. Use embedding models like
text-embedding-3-small from openai to vectorize text chunks, then cluster or split based on semantic similarity.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install numpy scikit-learn
Setup
Install the required Python packages and set your OpenAI API key as an environment variable.
- Install OpenAI SDK and dependencies:
pip install openai numpy scikit-learn Step by step
This example shows how to semantically chunk a long text by embedding overlapping text chunks and grouping them by similarity using cosine distance.
import os
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample long text
text = (
"Semantic chunking splits text into meaningful segments based on context. "
"This improves downstream tasks like search, summarization, and question answering. "
"We use embeddings to capture semantic similarity between chunks. "
"Chunks are created with overlap to preserve context across boundaries. "
"Finally, clustering or thresholding groups similar chunks together."
)
# Parameters
chunk_size = 50 # characters
chunk_overlap = 10 # characters
# Function to create overlapping chunks
def create_chunks(text, size, overlap):
chunks = []
start = 0
while start < len(text):
end = min(start + size, len(text))
chunk = text[start:end]
chunks.append(chunk)
start += size - overlap
return chunks
chunks = create_chunks(text, chunk_size, chunk_overlap)
# Get embeddings for each chunk
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunks
)
embeddings = [data.embedding for data in response.data]
# Compute cosine similarity matrix
similarity_matrix = cosine_similarity(embeddings)
# Simple semantic chunk grouping by threshold
threshold = 0.85
groups = []
visited = set()
for i in range(len(chunks)):
if i in visited:
continue
group = [chunks[i]]
visited.add(i)
for j in range(i + 1, len(chunks)):
if similarity_matrix[i][j] > threshold and j not in visited:
group.append(chunks[j])
visited.add(j)
groups.append(group)
# Output grouped semantic chunks
for idx, group in enumerate(groups, 1):
print(f"Group {idx}:")
for chunk in group:
print(f"- {chunk}")
print() output
Group 1: - Semantic chunking splits text into meaningful segments based on context. This improves downstream tasks like search, summarization, and question answering. Group 2: - We use embeddings to capture semantic similarity between chunks. Chunks are created with overlap to preserve context across boundaries. Group 3: - Finally, clustering or thresholding groups similar chunks together.
Common variations
You can perform semantic chunking asynchronously using async Python with the OpenAI SDK. Alternatively, use different embedding models like text-embedding-3-large for higher accuracy. For large documents, consider chunking by sentences or paragraphs before embedding.
import asyncio
import os
from openai import OpenAI
async def async_embedding(chunks):
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = await client.embeddings.acreate(
model="text-embedding-3-small",
input=chunks
)
return [data.embedding for data in response.data]
# Usage example
# embeddings = asyncio.run(async_embedding(chunks)) Troubleshooting
- If you get rate limit errors, reduce chunk size or add delays between requests.
- Ensure your API key is set correctly in
os.environ["OPENAI_API_KEY"]. - Embedding input size limits vary by model; split text accordingly.
Key Takeaways
- Use overlapping text chunks to preserve semantic context during chunking.
- Embed chunks with
text-embedding-3-smalland cluster by cosine similarity. - Adjust chunk size and similarity threshold based on your text and use case.