Code Beginner easy · 5 min

Minimum viable dataset

What you will learn
Create the smallest valid dataset structure that works with SFTTrainer to understand what fine-tuning actually requires.

Why this matters

Before you download massive datasets or worry about data cleaning, you need to know the bare minimum format that makes fine-tuning work. This unblocks experimentation and builds confidence that your pipeline is correct before scaling.

Skip if: You don't need a minimum viable dataset if you're using pre-built datasets from Hugging Face Hub or if you already have a production dataset pipeline. But if you're experimenting with custom data or diagnosing a data loading issue, this is essential.

Explanation

A minimum viable dataset for fine-tuning is just a list of dictionaries with a text key containing prompt-response pairs. That's it. No labels, no splits, no complex structure: just raw text examples.

Mechanically, SFTTrainer expects either a Dataset object from the datasets library or a list of dictionaries. When you pass it raw text, it tokenizes each example and trains the model to predict the next token in that sequence. The trainer handles batching, tokenization, and masking automatically. You don't need to write custom collate functions or preprocessing: SFTConfig handles the defaults.

Start with 5–10 examples in a single file. This teaches you whether your environment works, your model loads, and your format is correct. Then scale to real data.

Analogy

Like writing a unit test before building the whole application: you're not testing the real scenario yet, you're just confirming the plumbing works.

Code

Illustrative only - not runnable without a valid API key
python
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
import torch

# Create a minimum viable dataset: list of dicts with 'text' key
raw_data = [
    {"text": "Question: What is 2+2? Answer: 4"},
    {"text": "Question: What is the capital of France? Answer: Paris"},
    {"text": "Question: What is 10-3? Answer: 7"},
    {"text": "Question: What is the capital of Japan? Answer: Tokyo"},
    {"text": "Question: What is 5*6? Answer: 30"},
]

# Convert to Hugging Face Dataset
dataset = Dataset.from_dict({"text": [ex["text"] for ex in raw_data]})

print(f"Dataset loaded: {len(dataset)} examples")
print(f"First example: {dataset[0]['text'][:60]}...")

# Load a small model for demo (2.8B parameters)
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="cpu",
    trust_remote_code=True
)

# Configure SFT (Supervised Fine-Tuning)
training_args = SFTConfig(
    output_dir="./mvd_output",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    max_seq_length=256,
    learning_rate=5e-5,
    logging_steps=1,
)

# Create trainer with minimal dataset
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

print("\nTrainer initialized successfully")
print(f"Steps per epoch: {len(trainer.get_train_dataloader())}")
Output
Dataset loaded: 5 examples
First example: Question: What is 2+2? Answer: 4

Trainer initialized successfully
Steps per epoch: 3

What just happened?

We created a list of 5 text examples, converted it to a Hugging Face Dataset object (which SFTTrainer expects), loaded a tokenizer and model, configured SFTConfig with basic hyperparameters, and instantiated the trainer. The trainer examined the dataset and calculated that 3 batches would be needed per epoch (5 examples ÷ batch size 2 = 2.5 batches, rounded up).

Common gotcha

Developers often try to pass raw Python lists directly to SFTTrainer instead of converting to a Dataset object first. SFTTrainer expects datasets.Dataset, not a list. Also, many assume they need to pre-tokenize: you don't. SFTTrainer does it for you automatically based on max_seq_length.

Error recovery

TypeError: 'list' object is not subscriptable
You passed a raw list to train_dataset instead of a Dataset object. Use Dataset.from_dict() to convert.
AssertionError: tokenizer.pad_token is None
SFTTrainer requires a padding token. Add tokenizer.pad_token = tokenizer.eos_token before creating the trainer.
CUDA out of memory
You're using a model too large or batch size too high for your hardware. Reduce per_device_train_batch_size (try 1), reduce max_seq_length, or use a smaller model like phi-2 instead of llama-2-7b.

Experienced dev note

The biggest time sink for new fine-tuners is chasing dataset format issues that don't exist. Start with 5 hand-written examples and train for 1 epoch. If that works, your pipeline is solid: scale the data, not the complexity. Also: you cannot see actual training happen with 5 examples and 1 epoch (loss won't meaningfully decrease), but you'll know your code is correct, which is the entire point.

Check your understanding

Why can't you reduce max_seq_length to 32 tokens if your actual text examples are 50 tokens long, and what will happen if you try?

Show answer hint

A correct answer explains that SFTTrainer will truncate examples longer than max_seq_length, losing the end of your training data. The size should be at least as long as your longest example, plus some buffer.

VERSION SFTConfig and SFTTrainer as shown here require trl >= 0.7.0 (released mid-2024) and transformers >= 4.36.0. Earlier versions used deprecated TRL patterns like TrainingArguments without SFTConfig. If you're on trl < 0.7.0, you'll get ImportError or the trainer won't accept peft_config.
NEXT

Next, learn how to prepare real data files (CSV, JSONL) and split them into train/validation sets instead of hard-coding examples.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.