Code Beginner easy · 4 min

Quality over quantity: the principle

What you will learn

A smaller dataset of high-quality examples often produces better fine-tuned models than a large dataset of noisy or repetitive data.

Why this matters

You'll waste GPU time and money training on junk data. Understanding what constitutes 'quality' prevents you from spending days iterating on a fundamentally flawed dataset.

Skip if: When you're fine-tuning on a task where data quality doesn't vary much (e.g., structured format conversion with consistent, templated examples), quantity might matter more. Also, if you have a genuinely massive high-quality dataset (>1M examples) and computational budget is not a constraint, quantity becomes less of a bottleneck.

Explanation

Quality over quantity means that 10 expertly-written, diverse training examples often teach a model better than 10,000 repetitive or poorly-formatted ones. This is not just theory: it's measurable in downstream task performance and inference costs.

Mechanically, when you fine-tune a language model, each training step updates model weights based on the gradient signal from your example. A high-quality example (clear instruction, correct output, realistic use case) produces clean gradients that reinforce useful behavior. A low-quality example (ambiguous, contradictory, off-topic, or duplicated) produces noisy gradients or reinforces the wrong thing. Averaging noisy signals over thousands of steps doesn't recover the true signal: it just makes the model less confident about everything.

Use this principle as your first instinct: start with a small, curated dataset (50–500 examples) that you've personally validated. Train, evaluate, and measure performance. Only increase quantity if you have data quality guarantees or a clear performance plateau. This saves weeks of debugging 'why is my 50k-example model worse than the 500-example baseline?'

Analogy

Teaching a child: 10 clear, real-world lessons with immediate feedback beats 1,000 hours of reading from a poorly-written textbook. One well-explained concept sticks; a thousand vague ones do nothing.

Code

Illustrative only - not runnable without a valid API key

python

import json
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# High-quality curated dataset (SMALL)
high_quality_examples = [
    {"instruction": "Translate to French: Hello, how are you?", "output": "Bonjour, comment allez-vous?"},
    {"instruction": "Translate to French: What is your name?", "output": "Quel est votre nom?"},
    {"instruction": "Translate to French: I love learning languages.", "output": "J'adore apprendre les langues."},
]

# Low-quality dataset (LARGE but noisy)
low_quality_examples = [
    {"instruction": "translate french hello", "output": "hello is bonjour"},
    {"instruction": "Translate to French: Hello", "output": "Bonjour"},
    {"instruction": "hello french", "output": "bonjour"},
    {"instruction": "Translate to French: Hello, how are you?", "output": "Bonjour comment ca va"},
    {"instruction": "how u", "output": "comment vous"},
    {"instruction": "Translate to French: Hello, how are you?", "output": "Bonjour, comment allez-vous?"},
    {"instruction": "French hello what", "output": "bonjour quoi"},
    {"instruction": "Translate to French: I love languages", "output": "j'aime languages"},
] * 50  # Repeated 50 times to simulate junk dataset

# Create datasets
high_quality_dataset = Dataset.from_list(high_quality_examples)
low_quality_dataset = Dataset.from_list(low_quality_examples)

print(f"High-quality dataset: {len(high_quality_examples)} examples")
print(f"Low-quality dataset: {len(low_quality_dataset)} examples")
print(f"\nHigh-quality sample: {high_quality_examples[0]}")
print(f"Low-quality sample: {low_quality_examples[0]}")
print(f"\nDataset quality difference:")
print(f"  - High-quality: expert-formatted, diverse, no duplicates")
print(f"  - Low-quality: typos, inconsistent formatting, 400 duplicates")

Output

High-quality dataset: 3 examples
Low-quality dataset: 400 examples

High-quality sample: {'instruction': 'Translate to French: Hello, how are you?', 'output': 'Bonjour, comment allez-vous?'}
Low-quality sample: {'instruction': 'translate french hello', 'output': 'hello is bonjour'}

Dataset quality difference:
  - High-quality: expert-formatted, diverse, no duplicates
  - Low-quality: typos, inconsistent formatting, 400 duplicates

What just happened?

The code created two datasets side-by-side: one with 3 hand-curated, grammatically correct examples, and one with 400 noisy examples (many repeated). This illustrates the contrast: the 3-example set contains all the signal needed to learn the translation task; the 400-example set contains mostly noise and duplication that would waste training compute and confuse the model.

Common gotcha

Developers often assume 'more data = better model' because that was true for pre-training at billion-example scale. Fine-tuning is different: you're teaching a model a specific skill, not general language understanding. One bad example repeated 100 times is still one bad example; it just steals training steps from good ones. Measure your dataset quality before training, not after.

Error recovery

RuntimeError: CUDA out of memory

Symptom of training on a dataset so large that batches don't fit. Fix: reduce batch_size in SFTConfig or reduce dataset size. Quality-first approach naturally keeps datasets small and solvable.

Model loss increases during training

Often caused by training on low-quality examples that contradict each other. Fix: manually inspect 20 random examples from your dataset and remove contradictions, typos, and off-topic entries.

Experienced dev note

The shift from 'more data' thinking to 'better data' thinking is the single biggest productivity jump in fine-tuning work. Senior ML engineers spend 80% of time curating 500 examples and 20% training, not the reverse. Also: if your dataset is noisy, training longer makes it worse, not better. A 1-epoch run on good data beats 10 epochs on bad data every time. Budget accordingly.

Check your understanding

You have 10,000 customer support responses to fine-tune a model on. Half are from 2018 (old terminology, different product), half are from 2024 (current terminology). Your team says 'let's use all 10k to be safe.' Using the quality-first principle, what would you do differently and why?

Show answer hint

A correct answer identifies that the 2018 data introduces noise/contradiction into training, and proposes curating or filtering to the 5,000 recent, current-terminology examples. The key insight is that 5,000 relevant examples will train a better model than 10,000 mixed-era examples, even though intuition says 'more is safer.'

VERSION This principle is version-agnostic and applies to transformers 4.x and 5.x equally. However, with trl >= 1.0 and PEFT >= 0.11, training is efficient enough that you can experiment with small datasets on consumer GPUs (8GB VRAM), making quality-first iteration practical.

Now that you know quality matters, learn to <strong>measure</strong> it: how to structure evaluation sets and compute metrics that actually tell you if your fine-tuned model is getting better.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.