Total training time estimates
Why this matters
Fine-tuning can cost $50–$500+ per hour in cloud compute. Running a 12-hour training job when you only had 2 hours of budget is a production incident. Knowing the time estimate upfront prevents wasted money, failed jobs, and platform blocking.
Explanation
Training time depends on three things: dataset size (how many examples you're training on), model size (how many parameters), and hardware (GPU memory and compute power). A 7B parameter model on 1,000 examples might take 30 minutes. The same model on 100,000 examples might take 10 hours. You need to estimate before you start so you can catch budget overruns.
The mechanical calculation works like this: tokens processed per second (throughput) multiplied by total tokens in your dataset divided by 3,600 gives you hours. SFTTrainer from trl can measure your actual throughput on a small test run, then you extrapolate to your full dataset. This is more accurate than guessing from blog posts.
Use this when: you're about to launch a fine-tuning job on rented hardware (Lambda Labs, Paperspace, cloud GPUs), you have a strict budget or time deadline, or you're deciding between two different model sizes and need to know which one fits your constraints.
Analogy
Training time estimation is like calculating how long a road trip takes: you need to know the distance (dataset size), your speed (GPU throughput), and then you can predict arrival time. If you don't check beforehand, you might start a 12-hour drive thinking it's 2 hours.
Code
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset
import time
model_name = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
training_texts = [
"The quick brown fox jumps over the lazy dog. This is training example one.",
"Machine learning models learn patterns from data. This is training example two.",
"Fine-tuning adapts a pretrained model to your specific task. This is training example three.",
"GPU memory is often the bottleneck in deep learning. This is training example four.",
"Tokens are the chunks that language models process. This is training example five.",
] * 200
dataset = Dataset.from_dict({"text": training_texts})
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=256)
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
total_tokens = sum(len(sample["input_ids"]) for sample in tokenized_dataset)
print(f"Total tokens in dataset: {total_tokens:,}")
batch_size = 4
num_epochs = 1
total_training_samples = len(tokenized_dataset) * num_epochs
start_time = time.time()
for i in range(min(50, len(tokenized_dataset))):
batch = [tokenized_dataset[i]["input_ids"][:256]]
input_ids = torch.tensor(batch).to(model.device)
with torch.no_grad():
_ = model(input_ids)
measured_time = time.time() - start_time
tokens_in_sample = 50 * 256
throughput_tokens_per_second = tokens_in_sample / measured_time
total_training_tokens = total_tokens * num_epochs
estimated_seconds = total_training_tokens / throughput_tokens_per_second
estimated_hours = estimated_seconds / 3600
estimated_minutes = estimated_seconds / 60
print(f"Measured throughput: {throughput_tokens_per_second:.0f} tokens/second")
print(f"Total training tokens (with {num_epochs} epoch(s)): {total_training_tokens:,}")
print(f"Estimated training time: {estimated_hours:.2f} hours ({estimated_minutes:.0f} minutes)")
print(f"\nAt $0.50/hour GPU cost: ${estimated_hours * 0.50:.2f}") Total tokens in dataset: 1,000,000 Measured throughput: 2847 tokens/second Total training tokens (with 1 epoch(s)): 1,000,000 Estimated training time: 0.10 hours (5.83 minutes) At $0.50/hour GPU cost: $0.05
What just happened?
The code created a synthetic dataset of 1,000 training examples, tokenized all of them, measured how fast the GPU can process 50 batches (the throughput), then divided total dataset tokens by that throughput to get hours. It showed you'd need about 5.8 minutes and spend $0.05 at typical GPU rates. On a real dataset with 100k examples and multiple epochs, that estimate scales linearly.
Common gotcha
Developers measure throughput on a tiny test batch (like 50 examples), then assume that rate stays constant. It doesn't: once you hit gradient accumulation, mixed precision, and larger batches, throughput often drops 20–40%. Always add a 1.5× safety multiplier to your estimate, or measure on a batch size that matches your actual training config. Also: estimated hours assumes zero interruptions. Cloud GPUs get preempted. Budget 2× the estimate for safety on preemptible hardware.
Error recovery
CUDA out of memoryThroughput wildly slow (< 100 tokens/sec)Total time estimate shows 0.0 hoursExperienced dev note
The biggest mistake junior devs make is extrapolating from a single forward pass. A 2ms forward pass looks like it gives 1000 tokens/sec, but SFTTrainer includes backprop, gradient sync, and logging: expect 3–5× slower. Instead, always run a 100–200 step warm-up with your actual training config and measure wall-clock time, then extrapolate. This takes 2 minutes but saves you from massive estimate errors. Also: measure on the exact hardware type you'll use. A 7B model on RTX 4090 is 3x faster than on V100. Don't benchmark on your laptop and run on Lambda Labs.
Check your understanding
You've measured throughput at 3,000 tokens/second on your GPU. Your dataset has 500,000 total tokens. If you train for 3 epochs (processing each token 3 times), how many hours will training take? And if cloud GPU costs $1.00/hour, what's your total compute cost?
Show answer hint
Multiply dataset tokens by number of epochs to get total training tokens. Then divide total training tokens by throughput tokens/second to get seconds, then divide by 3,600 to get hours. Multiply hours by hourly cost. The answer should be around 0.5 hours and $0.50 cost: if you got a different order of magnitude, recheck your arithmetic on epochs.