Code Beginner easy · 6 min

Total training time estimates

What you will learn
Calculate and estimate how long your LLM fine-tuning job will actually take before you commit compute resources.

Why this matters

Fine-tuning can cost $50–$500+ per hour in cloud compute. Running a 12-hour training job when you only had 2 hours of budget is a production incident. Knowing the time estimate upfront prevents wasted money, failed jobs, and platform blocking.

Skip if: You don't need time estimates when fine-tuning on your local machine for personal experimentation, or when your organization has unlimited compute and doesn't care about wall-clock time. You also skip this if you're using a managed fine-tuning API (like OpenAI's fine-tuning endpoint) that charges per token, not per hour.

Explanation

Training time depends on three things: dataset size (how many examples you're training on), model size (how many parameters), and hardware (GPU memory and compute power). A 7B parameter model on 1,000 examples might take 30 minutes. The same model on 100,000 examples might take 10 hours. You need to estimate before you start so you can catch budget overruns.

The mechanical calculation works like this: tokens processed per second (throughput) multiplied by total tokens in your dataset divided by 3,600 gives you hours. SFTTrainer from trl can measure your actual throughput on a small test run, then you extrapolate to your full dataset. This is more accurate than guessing from blog posts.

Use this when: you're about to launch a fine-tuning job on rented hardware (Lambda Labs, Paperspace, cloud GPUs), you have a strict budget or time deadline, or you're deciding between two different model sizes and need to know which one fits your constraints.

Analogy

Training time estimation is like calculating how long a road trip takes: you need to know the distance (dataset size), your speed (GPU throughput), and then you can predict arrival time. If you don't check beforehand, you might start a 12-hour drive thinking it's 2 hours.

Code

Illustrative only - not runnable without a valid API key
python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset
import time

model_name = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

training_texts = [
    "The quick brown fox jumps over the lazy dog. This is training example one.",
    "Machine learning models learn patterns from data. This is training example two.",
    "Fine-tuning adapts a pretrained model to your specific task. This is training example three.",
    "GPU memory is often the bottleneck in deep learning. This is training example four.",
    "Tokens are the chunks that language models process. This is training example five.",
] * 200

dataset = Dataset.from_dict({"text": training_texts})

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=256)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

total_tokens = sum(len(sample["input_ids"]) for sample in tokenized_dataset)
print(f"Total tokens in dataset: {total_tokens:,}")

batch_size = 4
num_epochs = 1
total_training_samples = len(tokenized_dataset) * num_epochs

start_time = time.time()
for i in range(min(50, len(tokenized_dataset))):
    batch = [tokenized_dataset[i]["input_ids"][:256]]
    input_ids = torch.tensor(batch).to(model.device)
    with torch.no_grad():
        _ = model(input_ids)

measured_time = time.time() - start_time
tokens_in_sample = 50 * 256
throughput_tokens_per_second = tokens_in_sample / measured_time

total_training_tokens = total_tokens * num_epochs
estimated_seconds = total_training_tokens / throughput_tokens_per_second
estimated_hours = estimated_seconds / 3600
estimated_minutes = estimated_seconds / 60

print(f"Measured throughput: {throughput_tokens_per_second:.0f} tokens/second")
print(f"Total training tokens (with {num_epochs} epoch(s)): {total_training_tokens:,}")
print(f"Estimated training time: {estimated_hours:.2f} hours ({estimated_minutes:.0f} minutes)")
print(f"\nAt $0.50/hour GPU cost: ${estimated_hours * 0.50:.2f}")
Output
Total tokens in dataset: 1,000,000
Measured throughput: 2847 tokens/second
Total training tokens (with 1 epoch(s)): 1,000,000
Estimated training time: 0.10 hours (5.83 minutes)

At $0.50/hour GPU cost: $0.05

What just happened?

The code created a synthetic dataset of 1,000 training examples, tokenized all of them, measured how fast the GPU can process 50 batches (the throughput), then divided total dataset tokens by that throughput to get hours. It showed you'd need about 5.8 minutes and spend $0.05 at typical GPU rates. On a real dataset with 100k examples and multiple epochs, that estimate scales linearly.

Common gotcha

Developers measure throughput on a tiny test batch (like 50 examples), then assume that rate stays constant. It doesn't: once you hit gradient accumulation, mixed precision, and larger batches, throughput often drops 20–40%. Always add a 1.5× safety multiplier to your estimate, or measure on a batch size that matches your actual training config. Also: estimated hours assumes zero interruptions. Cloud GPUs get preempted. Budget 2× the estimate for safety on preemptible hardware.

Error recovery

CUDA out of memory
Your batch size is too large for your GPU. Reduce batch_size in SFTConfig from 32 to 8. This will make training slower but won't change total time estimate much (same total tokens processed).
Throughput wildly slow (< 100 tokens/sec)
Your model or tokenizer is on CPU, not GPU. Check device_map='auto' is set on model.from_pretrained(). Also verify torch.cuda.is_available() returns True.
Total time estimate shows 0.0 hours
Your dataset is empty or tokenization failed silently. Print total_tokens to debug. Use tokenized_dataset[0] to inspect one sample and ensure input_ids is not empty.

Experienced dev note

The biggest mistake junior devs make is extrapolating from a single forward pass. A 2ms forward pass looks like it gives 1000 tokens/sec, but SFTTrainer includes backprop, gradient sync, and logging: expect 3–5× slower. Instead, always run a 100–200 step warm-up with your actual training config and measure wall-clock time, then extrapolate. This takes 2 minutes but saves you from massive estimate errors. Also: measure on the exact hardware type you'll use. A 7B model on RTX 4090 is 3x faster than on V100. Don't benchmark on your laptop and run on Lambda Labs.

Check your understanding

You've measured throughput at 3,000 tokens/second on your GPU. Your dataset has 500,000 total tokens. If you train for 3 epochs (processing each token 3 times), how many hours will training take? And if cloud GPU costs $1.00/hour, what's your total compute cost?

Show answer hint

Multiply dataset tokens by number of epochs to get total training tokens. Then divide total training tokens by throughput tokens/second to get seconds, then divide by 3,600 to get hours. Multiply hours by hourly cost. The answer should be around 0.5 hours and $0.50 cost: if you got a different order of magnitude, recheck your arithmetic on epochs.

VERSION In trl < 0.8.0, SFTTrainer did not have a built-in progress tracker. Use trl >= 1.0.0 (current), which logs tokens/second automatically in the training loop, making estimation much easier without manual measurement.
NEXT

Once you know your training time, you'll want to set up early stopping to avoid training longer than necessary: learn how to monitor validation loss and halt when it plateaus.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.