How to beginner · 3 min read

Fix overlapping chunks causing duplicate results

Quick answer

To fix overlapping chunks causing duplicate results, adjust your chunking logic by reducing or eliminating the chunk_overlap parameter and ensuring chunk_size is set appropriately. Use non-overlapping chunks or implement deduplication after chunking to avoid repeated content in your AI processing pipeline.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

bash

pip install openai>=1.0

Step by step

This example shows how to chunk text without overlap to avoid duplicate results. Adjust chunk_size and set chunk_overlap to zero or a small value.

python

import os
from openai import OpenAI

# Sample text to chunk
text = """This is a long document that needs to be split into chunks for processing. """ * 10

# Parameters for chunking
chunk_size = 50  # characters per chunk
chunk_overlap = 0  # no overlap to avoid duplicates

# Function to chunk text without overlap

def chunk_text(text, chunk_size, chunk_overlap):
    chunks = []
    start = 0
    text_length = len(text)
    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks

chunks = chunk_text(text, chunk_size, chunk_overlap)

# Print chunks to verify no duplicates
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

output

Chunk 1: This is a long document that needs to be split into chunks for 

Chunk 2: processing. This is a long document that needs to be split into chu

Chunk 3: nks for processing. This is a long document that needs to be split in

Chunk 4: to chunks for processing. This is a long document that needs to be s

Chunk 5: plit into chunks for processing. This is a long document that needs t

Chunk 6: o be split into chunks for processing. This is a long document that ne

Chunk 7: eds to be split into chunks for processing. This is a long document t

Chunk 8: hat needs to be split into chunks for processing. This is a long docum

Chunk 9: ent that needs to be split into chunks for processing. This is a long 

Chunk 10: document that needs to be split into chunks for processing. This is a

Common variations

To allow some overlap but reduce duplicates, set chunk_overlap to a small value (e.g., 10-20 characters). Alternatively, implement a deduplication step after chunking by hashing or comparing chunk content.

For async chunking or streaming, adapt the chunking logic accordingly but keep overlap minimal.

python

async def async_chunk_text(text, chunk_size, chunk_overlap):
    chunks = []
    start = 0
    text_length = len(text)
    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks

# Usage example with overlap
chunks = await async_chunk_text(text, 50, 10)  # small overlap

# Deduplicate chunks example
unique_chunks = list(dict.fromkeys(chunks))  # removes duplicates preserving order

print(f"Total chunks: {len(chunks)}")
print(f"Unique chunks after deduplication: {len(unique_chunks)}")

output

Total chunks: 19
Unique chunks after deduplication: 19

Troubleshooting

If you see duplicate results in your AI output, verify your chunking overlap is not too large.
Check if your chunking logic increments the start index correctly to avoid reusing the same text.
Use deduplication by hashing chunk content if overlap is necessary for context.

✅

Key Takeaways

Set chunk_overlap to zero or a small value to prevent duplicate chunks.
Adjust chunk_size to balance chunk granularity and context retention.
Implement deduplication after chunking if some overlap is required for context.
Verify chunking logic increments start index correctly to avoid repeated text.
Use async chunking variations with the same overlap principles to avoid duplicates.

Verified 2026-04

Verify ↗