How to beginner · 3 min read

Fix overlapping chunks causing duplicate results

Quick answer
To fix overlapping chunks causing duplicate results, adjust your chunking logic by reducing or eliminating the chunk_overlap parameter and ensuring chunk_size is set appropriately. Use non-overlapping chunks or implement deduplication after chunking to avoid repeated content in your AI processing pipeline.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

bash
pip install openai>=1.0

Step by step

This example shows how to chunk text without overlap to avoid duplicate results. Adjust chunk_size and set chunk_overlap to zero or a small value.

python
import os
from openai import OpenAI

# Sample text to chunk
text = """This is a long document that needs to be split into chunks for processing. """ * 10

# Parameters for chunking
chunk_size = 50  # characters per chunk
chunk_overlap = 0  # no overlap to avoid duplicates

# Function to chunk text without overlap

def chunk_text(text, chunk_size, chunk_overlap):
    chunks = []
    start = 0
    text_length = len(text)
    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks

chunks = chunk_text(text, chunk_size, chunk_overlap)

# Print chunks to verify no duplicates
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")
output
Chunk 1: This is a long document that needs to be split into chunks for 

Chunk 2: processing. This is a long document that needs to be split into chu

Chunk 3: nks for processing. This is a long document that needs to be split in

Chunk 4: to chunks for processing. This is a long document that needs to be s

Chunk 5: plit into chunks for processing. This is a long document that needs t

Chunk 6: o be split into chunks for processing. This is a long document that ne

Chunk 7: eds to be split into chunks for processing. This is a long document t

Chunk 8: hat needs to be split into chunks for processing. This is a long docum

Chunk 9: ent that needs to be split into chunks for processing. This is a long 

Chunk 10: document that needs to be split into chunks for processing. This is a 

Common variations

To allow some overlap but reduce duplicates, set chunk_overlap to a small value (e.g., 10-20 characters). Alternatively, implement a deduplication step after chunking by hashing or comparing chunk content.

For async chunking or streaming, adapt the chunking logic accordingly but keep overlap minimal.

python
async def async_chunk_text(text, chunk_size, chunk_overlap):
    chunks = []
    start = 0
    text_length = len(text)
    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks

# Usage example with overlap
chunks = await async_chunk_text(text, 50, 10)  # small overlap

# Deduplicate chunks example
unique_chunks = list(dict.fromkeys(chunks))  # removes duplicates preserving order

print(f"Total chunks: {len(chunks)}")
print(f"Unique chunks after deduplication: {len(unique_chunks)}")
output
Total chunks: 19
Unique chunks after deduplication: 19

Troubleshooting

  • If you see duplicate results in your AI output, verify your chunking overlap is not too large.
  • Check if your chunking logic increments the start index correctly to avoid reusing the same text.
  • Use deduplication by hashing chunk content if overlap is necessary for context.

Key Takeaways

  • Set chunk_overlap to zero or a small value to prevent duplicate chunks.
  • Adjust chunk_size to balance chunk granularity and context retention.
  • Implement deduplication after chunking if some overlap is required for context.
  • Verify chunking logic increments start index correctly to avoid repeated text.
  • Use async chunking variations with the same overlap principles to avoid duplicates.
Verified 2026-04
Verify ↗