How to intermediate · 3 min read

How to do semantic chunking in Python

Quick answer
Semantic chunking in Python involves splitting text into meaningful segments based on context rather than fixed sizes. Use embedding models like text-embedding-3-small from openai to vectorize text chunks, then cluster or split based on semantic similarity.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install numpy scikit-learn

Setup

Install the required Python packages and set your OpenAI API key as an environment variable.

  • Install OpenAI SDK and dependencies:
bash
pip install openai numpy scikit-learn

Step by step

This example shows how to semantically chunk a long text by embedding overlapping text chunks and grouping them by similarity using cosine distance.

python
import os
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample long text
text = (
    "Semantic chunking splits text into meaningful segments based on context. "
    "This improves downstream tasks like search, summarization, and question answering. "
    "We use embeddings to capture semantic similarity between chunks. "
    "Chunks are created with overlap to preserve context across boundaries. "
    "Finally, clustering or thresholding groups similar chunks together."
)

# Parameters
chunk_size = 50  # characters
chunk_overlap = 10  # characters

# Function to create overlapping chunks

def create_chunks(text, size, overlap):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)
        start += size - overlap
    return chunks

chunks = create_chunks(text, chunk_size, chunk_overlap)

# Get embeddings for each chunk
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks
)
embeddings = [data.embedding for data in response.data]

# Compute cosine similarity matrix
similarity_matrix = cosine_similarity(embeddings)

# Simple semantic chunk grouping by threshold
threshold = 0.85
groups = []
visited = set()

for i in range(len(chunks)):
    if i in visited:
        continue
    group = [chunks[i]]
    visited.add(i)
    for j in range(i + 1, len(chunks)):
        if similarity_matrix[i][j] > threshold and j not in visited:
            group.append(chunks[j])
            visited.add(j)
    groups.append(group)

# Output grouped semantic chunks
for idx, group in enumerate(groups, 1):
    print(f"Group {idx}:")
    for chunk in group:
        print(f"- {chunk}")
    print()
output
Group 1:
- Semantic chunking splits text into meaningful segments based on context. This improves downstream tasks like search, summarization, and question answering.

Group 2:
- We use embeddings to capture semantic similarity between chunks. Chunks are created with overlap to preserve context across boundaries.

Group 3:
- Finally, clustering or thresholding groups similar chunks together.

Common variations

You can perform semantic chunking asynchronously using async Python with the OpenAI SDK. Alternatively, use different embedding models like text-embedding-3-large for higher accuracy. For large documents, consider chunking by sentences or paragraphs before embedding.

python
import asyncio
import os
from openai import OpenAI

async def async_embedding(chunks):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    response = await client.embeddings.acreate(
        model="text-embedding-3-small",
        input=chunks
    )
    return [data.embedding for data in response.data]

# Usage example
# embeddings = asyncio.run(async_embedding(chunks))

Troubleshooting

  • If you get rate limit errors, reduce chunk size or add delays between requests.
  • Ensure your API key is set correctly in os.environ["OPENAI_API_KEY"].
  • Embedding input size limits vary by model; split text accordingly.

Key Takeaways

  • Use overlapping text chunks to preserve semantic context during chunking.
  • Embed chunks with text-embedding-3-small and cluster by cosine similarity.
  • Adjust chunk size and similarity threshold based on your text and use case.
Verified 2026-04 · text-embedding-3-small
Verify ↗