How to beginner · 3 min read

How to deduplicate training data

Quick answer
To deduplicate training data for fine-tuning, use hashing or fingerprinting methods to identify and remove exact or near-duplicate entries. This ensures cleaner datasets, reduces overfitting, and improves model generalization during fine-tuning.

PREREQUISITES

  • Python 3.8+
  • pip install pandas
  • pip install datasketch

Setup

Install necessary Python libraries for data manipulation and deduplication.

bash
pip install pandas datasketch

Step by step

This example shows how to deduplicate a dataset of text samples using hashing and pandas.

python
import pandas as pd
import hashlib

# Sample training data with duplicates
training_data = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog.",  # duplicate
    "An apple a day keeps the doctor away.",
    "An apple a day keeps the doctor away!",  # near duplicate
    "To be or not to be, that is the question."
]

# Create DataFrame
df = pd.DataFrame(training_data, columns=["text"])

# Function to hash text for exact duplicate detection
def hash_text(text):
    return hashlib.sha256(text.encode('utf-8')).hexdigest()

# Add hash column
df['hash'] = df['text'].apply(hash_text)

# Drop exact duplicates based on hash
df_dedup = df.drop_duplicates(subset=['hash'])

print("Deduplicated data (exact matches removed):")
print(df_dedup['text'].to_list())
output
Deduplicated data (exact matches removed):
['The quick brown fox jumps over the lazy dog.', 'An apple a day keeps the doctor away.', 'An apple a day keeps the doctor away!', 'To be or not to be, that is the question.']

Common variations

For near-duplicate detection, use fuzzy matching or MinHash techniques to catch minor text variations.

Example using MinHash from datasketch:

python
from datasketch import MinHash, MinHashLSH

# Function to create MinHash from text

def get_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for word in text.lower().split():
        m.update(word.encode('utf8'))
    return m

texts = [
    "An apple a day keeps the doctor away.",
    "An apple a day keeps the doctor away!",
    "An apple a day keeps the doctor away",
    "To be or not to be, that is the question."
]

# Create MinHash objects
minhashes = [get_minhash(t) for t in texts]

# Use LSH to find near duplicates
lsh = MinHashLSH(threshold=0.8, num_perm=128)
for i, m in enumerate(minhashes):
    lsh.insert(f"m{i}", m)

# Query near duplicates for first text
result = lsh.query(minhashes[0])

print("Near duplicates found for first text:", result)
output
Near duplicates found for first text: ['m0', 'm1', 'm2']

Troubleshooting

  • If exact duplicates remain, verify hashing function consistency and encoding.
  • For large datasets, use batch processing or scalable libraries like Apache Spark.
  • Near-duplicate detection may require tuning similarity thresholds to balance recall and precision.

Key Takeaways

  • Use hashing to efficiently remove exact duplicate training samples before fine-tuning.
  • Apply MinHash or fuzzy matching to detect and remove near-duplicates for cleaner data.
  • Deduplication improves model quality by reducing overfitting and redundant learning.
Verified 2026-04
Verify ↗