How to deduplicate training data
Quick answer
To deduplicate training data for fine-tuning, use hashing or fingerprinting methods to identify and remove exact or near-duplicate entries. This ensures cleaner datasets, reduces overfitting, and improves model generalization during fine-tuning.
PREREQUISITES
Python 3.8+pip install pandaspip install datasketch
Setup
Install necessary Python libraries for data manipulation and deduplication.
pip install pandas datasketch Step by step
This example shows how to deduplicate a dataset of text samples using hashing and pandas.
import pandas as pd
import hashlib
# Sample training data with duplicates
training_data = [
"The quick brown fox jumps over the lazy dog.",
"The quick brown fox jumps over the lazy dog.", # duplicate
"An apple a day keeps the doctor away.",
"An apple a day keeps the doctor away!", # near duplicate
"To be or not to be, that is the question."
]
# Create DataFrame
df = pd.DataFrame(training_data, columns=["text"])
# Function to hash text for exact duplicate detection
def hash_text(text):
return hashlib.sha256(text.encode('utf-8')).hexdigest()
# Add hash column
df['hash'] = df['text'].apply(hash_text)
# Drop exact duplicates based on hash
df_dedup = df.drop_duplicates(subset=['hash'])
print("Deduplicated data (exact matches removed):")
print(df_dedup['text'].to_list()) output
Deduplicated data (exact matches removed): ['The quick brown fox jumps over the lazy dog.', 'An apple a day keeps the doctor away.', 'An apple a day keeps the doctor away!', 'To be or not to be, that is the question.']
Common variations
For near-duplicate detection, use fuzzy matching or MinHash techniques to catch minor text variations.
Example using MinHash from datasketch:
from datasketch import MinHash, MinHashLSH
# Function to create MinHash from text
def get_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
for word in text.lower().split():
m.update(word.encode('utf8'))
return m
texts = [
"An apple a day keeps the doctor away.",
"An apple a day keeps the doctor away!",
"An apple a day keeps the doctor away",
"To be or not to be, that is the question."
]
# Create MinHash objects
minhashes = [get_minhash(t) for t in texts]
# Use LSH to find near duplicates
lsh = MinHashLSH(threshold=0.8, num_perm=128)
for i, m in enumerate(minhashes):
lsh.insert(f"m{i}", m)
# Query near duplicates for first text
result = lsh.query(minhashes[0])
print("Near duplicates found for first text:", result) output
Near duplicates found for first text: ['m0', 'm1', 'm2']
Troubleshooting
- If exact duplicates remain, verify hashing function consistency and encoding.
- For large datasets, use batch processing or scalable libraries like Apache Spark.
- Near-duplicate detection may require tuning similarity thresholds to balance recall and precision.
Key Takeaways
- Use hashing to efficiently remove exact duplicate training samples before fine-tuning.
- Apply MinHash or fuzzy matching to detect and remove near-duplicates for cleaner data.
- Deduplication improves model quality by reducing overfitting and redundant learning.