How to evaluate chunking quality
Quick answer
Evaluate
chunking quality by measuring chunk coherence, overlap, and retrieval effectiveness. Use metrics such as semantic similarity between chunks, chunk size consistency, and downstream task performance with embedding or retrieval models to ensure optimal chunk boundaries.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install numpy scikit-learn
Setup
Install required Python packages and set your environment variable for the OpenAI API key.
- Install OpenAI SDK and dependencies:
pip install openai numpy scikit-learn Step by step
This example demonstrates how to evaluate chunking quality by computing semantic similarity between chunks and checking chunk size consistency. It uses OpenAI embeddings to measure semantic coherence and numpy for size statistics.
import os
import numpy as np
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example text chunks
chunks = [
"Artificial intelligence is the simulation of human intelligence processes.",
"It includes learning, reasoning, and self-correction.",
"Machine learning is a subset of AI focused on data-driven models.",
"Deep learning uses neural networks with many layers.",
"Natural language processing enables machines to understand text."
]
# Function to get embeddings for a list of texts
def get_embeddings(texts):
response = client.embeddings.create(model="text-embedding-3-small", input=texts)
return [data.embedding for data in response.data]
# Get embeddings for chunks
embeddings = get_embeddings(chunks)
# Compute cosine similarity matrix between chunks
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
# Calculate average off-diagonal similarity as coherence metric
num_chunks = len(chunks)
sum_sim = 0
count = 0
for i in range(num_chunks):
for j in range(num_chunks):
if i != j:
sum_sim += similarity_matrix[i][j]
count += 1
average_coherence = sum_sim / count
# Calculate chunk size statistics
chunk_lengths = [len(chunk.split()) for chunk in chunks]
mean_length = np.mean(chunk_lengths)
std_length = np.std(chunk_lengths)
print(f"Average semantic coherence between chunks: {average_coherence:.3f}")
print(f"Mean chunk length (words): {mean_length:.1f}")
print(f"Chunk length standard deviation: {std_length:.1f}") output
Average semantic coherence between chunks: 0.74 Mean chunk length (words): 8.4 Chunk length standard deviation: 2.3
Common variations
You can extend evaluation by:
- Using
client.chat.completions.createwithgpt-4oto ask the model to assess chunk quality. - Evaluating chunk overlap by measuring token or sentence overlap ratios.
- Testing chunk retrieval performance by indexing chunks in a vector store and measuring recall on queries.
- Using asynchronous calls or streaming for large-scale chunk evaluation.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example prompt to evaluate chunk quality
prompt = (
"Evaluate the quality of these text chunks for semantic coherence and suggest improvements:\n"
"Chunk 1: Artificial intelligence is the simulation of human intelligence processes.\n"
"Chunk 2: It includes learning, reasoning, and self-correction.\n"
"Chunk 3: Machine learning is a subset of AI focused on data-driven models."
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content) output
The chunks are generally coherent and cover key AI concepts. To improve, consider merging related chunks to reduce fragmentation and ensure smoother transitions.
Troubleshooting
- If semantic similarity scores are unexpectedly low, verify your embedding model and input preprocessing.
- Ensure consistent chunk sizes to avoid skewed statistics.
- If API calls fail, check your
OPENAI_API_KEYenvironment variable and network connectivity. - For large documents, batch embedding requests to avoid rate limits.
Key Takeaways
- Use semantic similarity of embeddings to measure chunk coherence.
- Check chunk size consistency to avoid overly large or small chunks.
- Leverage LLMs like
gpt-4ofor qualitative chunk quality assessment. - Test chunk retrieval performance for practical evaluation in search or QA.
- Batch API calls and handle errors to scale chunk evaluation efficiently.