Quality over quantity: the principle
Why this matters
You'll waste GPU time and money training on junk data. Understanding what constitutes 'quality' prevents you from spending days iterating on a fundamentally flawed dataset.
Explanation
Quality over quantity means that 10 expertly-written, diverse training examples often teach a model better than 10,000 repetitive or poorly-formatted ones. This is not just theory: it's measurable in downstream task performance and inference costs.
Mechanically, when you fine-tune a language model, each training step updates model weights based on the gradient signal from your example. A high-quality example (clear instruction, correct output, realistic use case) produces clean gradients that reinforce useful behavior. A low-quality example (ambiguous, contradictory, off-topic, or duplicated) produces noisy gradients or reinforces the wrong thing. Averaging noisy signals over thousands of steps doesn't recover the true signal: it just makes the model less confident about everything.
Use this principle as your first instinct: start with a small, curated dataset (50–500 examples) that you've personally validated. Train, evaluate, and measure performance. Only increase quantity if you have data quality guarantees or a clear performance plateau. This saves weeks of debugging 'why is my 50k-example model worse than the 500-example baseline?'
Analogy
Teaching a child: 10 clear, real-world lessons with immediate feedback beats 1,000 hours of reading from a poorly-written textbook. One well-explained concept sticks; a thousand vague ones do nothing.
Code
import json
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
# High-quality curated dataset (SMALL)
high_quality_examples = [
{"instruction": "Translate to French: Hello, how are you?", "output": "Bonjour, comment allez-vous?"},
{"instruction": "Translate to French: What is your name?", "output": "Quel est votre nom?"},
{"instruction": "Translate to French: I love learning languages.", "output": "J'adore apprendre les langues."},
]
# Low-quality dataset (LARGE but noisy)
low_quality_examples = [
{"instruction": "translate french hello", "output": "hello is bonjour"},
{"instruction": "Translate to French: Hello", "output": "Bonjour"},
{"instruction": "hello french", "output": "bonjour"},
{"instruction": "Translate to French: Hello, how are you?", "output": "Bonjour comment ca va"},
{"instruction": "how u", "output": "comment vous"},
{"instruction": "Translate to French: Hello, how are you?", "output": "Bonjour, comment allez-vous?"},
{"instruction": "French hello what", "output": "bonjour quoi"},
{"instruction": "Translate to French: I love languages", "output": "j'aime languages"},
] * 50 # Repeated 50 times to simulate junk dataset
# Create datasets
high_quality_dataset = Dataset.from_list(high_quality_examples)
low_quality_dataset = Dataset.from_list(low_quality_examples)
print(f"High-quality dataset: {len(high_quality_examples)} examples")
print(f"Low-quality dataset: {len(low_quality_dataset)} examples")
print(f"\nHigh-quality sample: {high_quality_examples[0]}")
print(f"Low-quality sample: {low_quality_examples[0]}")
print(f"\nDataset quality difference:")
print(f" - High-quality: expert-formatted, diverse, no duplicates")
print(f" - Low-quality: typos, inconsistent formatting, 400 duplicates") High-quality dataset: 3 examples
Low-quality dataset: 400 examples
High-quality sample: {'instruction': 'Translate to French: Hello, how are you?', 'output': 'Bonjour, comment allez-vous?'}
Low-quality sample: {'instruction': 'translate french hello', 'output': 'hello is bonjour'}
Dataset quality difference:
- High-quality: expert-formatted, diverse, no duplicates
- Low-quality: typos, inconsistent formatting, 400 duplicates What just happened?
The code created two datasets side-by-side: one with 3 hand-curated, grammatically correct examples, and one with 400 noisy examples (many repeated). This illustrates the contrast: the 3-example set contains all the signal needed to learn the translation task; the 400-example set contains mostly noise and duplication that would waste training compute and confuse the model.
Common gotcha
Developers often assume 'more data = better model' because that was true for pre-training at billion-example scale. Fine-tuning is different: you're teaching a model a specific skill, not general language understanding. One bad example repeated 100 times is still one bad example; it just steals training steps from good ones. Measure your dataset quality before training, not after.
Error recovery
RuntimeError: CUDA out of memoryModel loss increases during trainingExperienced dev note
The shift from 'more data' thinking to 'better data' thinking is the single biggest productivity jump in fine-tuning work. Senior ML engineers spend 80% of time curating 500 examples and 20% training, not the reverse. Also: if your dataset is noisy, training longer makes it worse, not better. A 1-epoch run on good data beats 10 epochs on bad data every time. Budget accordingly.
Check your understanding
You have 10,000 customer support responses to fine-tune a model on. Half are from 2018 (old terminology, different product), half are from 2024 (current terminology). Your team says 'let's use all 10k to be safe.' Using the quality-first principle, what would you do differently and why?
Show answer hint
A correct answer identifies that the 2018 data introduces noise/contradiction into training, and proposes curating or filtering to the 5,000 recent, current-terminology examples. The key insight is that 5,000 relevant examples will train a better model than 10,000 mixed-era examples, even though intuition says 'more is safer.'