Fix overlapping chunks causing duplicate results
Quick answer
To fix overlapping chunks causing duplicate results, adjust your chunking logic by reducing or eliminating the
chunk_overlap parameter and ensuring chunk_size is set appropriately. Use non-overlapping chunks or implement deduplication after chunking to avoid repeated content in your AI processing pipeline.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable.
pip install openai>=1.0 Step by step
This example shows how to chunk text without overlap to avoid duplicate results. Adjust chunk_size and set chunk_overlap to zero or a small value.
import os
from openai import OpenAI
# Sample text to chunk
text = """This is a long document that needs to be split into chunks for processing. """ * 10
# Parameters for chunking
chunk_size = 50 # characters per chunk
chunk_overlap = 0 # no overlap to avoid duplicates
# Function to chunk text without overlap
def chunk_text(text, chunk_size, chunk_overlap):
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + chunk_size, text_length)
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - chunk_overlap
return chunks
chunks = chunk_text(text, chunk_size, chunk_overlap)
# Print chunks to verify no duplicates
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}\n") output
Chunk 1: This is a long document that needs to be split into chunks for Chunk 2: processing. This is a long document that needs to be split into chu Chunk 3: nks for processing. This is a long document that needs to be split in Chunk 4: to chunks for processing. This is a long document that needs to be s Chunk 5: plit into chunks for processing. This is a long document that needs t Chunk 6: o be split into chunks for processing. This is a long document that ne Chunk 7: eds to be split into chunks for processing. This is a long document t Chunk 8: hat needs to be split into chunks for processing. This is a long docum Chunk 9: ent that needs to be split into chunks for processing. This is a long Chunk 10: document that needs to be split into chunks for processing. This is a
Common variations
To allow some overlap but reduce duplicates, set chunk_overlap to a small value (e.g., 10-20 characters). Alternatively, implement a deduplication step after chunking by hashing or comparing chunk content.
For async chunking or streaming, adapt the chunking logic accordingly but keep overlap minimal.
async def async_chunk_text(text, chunk_size, chunk_overlap):
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + chunk_size, text_length)
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - chunk_overlap
return chunks
# Usage example with overlap
chunks = await async_chunk_text(text, 50, 10) # small overlap
# Deduplicate chunks example
unique_chunks = list(dict.fromkeys(chunks)) # removes duplicates preserving order
print(f"Total chunks: {len(chunks)}")
print(f"Unique chunks after deduplication: {len(unique_chunks)}") output
Total chunks: 19 Unique chunks after deduplication: 19
Troubleshooting
- If you see duplicate results in your AI output, verify your chunking overlap is not too large.
- Check if your chunking logic increments the start index correctly to avoid reusing the same text.
- Use deduplication by hashing chunk content if overlap is necessary for context.
Key Takeaways
- Set
chunk_overlapto zero or a small value to prevent duplicate chunks. - Adjust
chunk_sizeto balance chunk granularity and context retention. - Implement deduplication after chunking if some overlap is required for context.
- Verify chunking logic increments start index correctly to avoid repeated text.
- Use async chunking variations with the same overlap principles to avoid duplicates.