Code Beginner easy · 4 min

Ignoring latency and cost changes

What you will learn

Fine-tuning changes your model's inference speed and token usage, so you must measure both before and after to catch surprises.

Why this matters

A finely tuned model that cuts your input token count by 30% but increases latency by 3 seconds per request can actually make your product worse and more expensive, not better. Developers often only measure accuracy and miss the operational impact.

Skip if: If you're fine-tuning purely for research or a one-time batch job where latency doesn't matter and you're never serving it in production, you can skip this. But if you're replacing a production model, you must measure.

Explanation

When you fine-tune an LLM, three metrics change: accuracy (which everyone measures), latency (how long inference takes), and token efficiency (how many tokens your model uses for the same task). Most developers obsess over accuracy and ignore the other two.

Latency changes because fine-tuning modifies the model's internal weights, which can slow down matrix multiplications slightly, or speed them up if you used quantization. Token efficiency changes because your fine-tuned model may learn to be more verbose or more concise than the base model. If your original model uses 100 tokens per response at $0.001/token, and your fine-tuned model uses 150 tokens at the same cost, you've just increased your per-request cost by 50%: even if accuracy improved by 5%.

You should establish a baseline (latency and token count from the original model) before fine-tuning, then measure the same metrics on your fine-tuned model after training. If latency increases beyond your SLA or token count increases more than your cost budget allows, you need to adjust: add quantization, reduce fine-tuning epochs, or reconsider whether fine-tuning is the right solution.

Analogy

Upgrading a car engine for more power: you measure 0-60 times and fuel economy before and after. If the new engine is 10% faster but uses 40% more gas and your budget only allows 20% more fuel spend, the upgrade backfires despite being 'better.'

Code

Illustrative only - not runnable without a valid API key

python

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()

if torch.cuda.is_available():
    model = model.to("cuda")

test_prompt = "Explain quantum computing in simple terms:"
input_ids = tokenizer.encode(test_prompt, return_tensors="pt")
if torch.cuda.is_available():
    input_ids = input_ids.to("cuda")

print(f"Baseline model: {model_name}")
print(f"Input tokens: {input_ids.shape[1]}")

start_time = time.time()
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=100,
        num_return_sequences=1,
        temperature=0.7
    )
latency_ms = (time.time() - start_time) * 1000

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
output_tokens = output.shape[1] - input_ids.shape[1]

print(f"Output tokens: {output_tokens}")
print(f"Latency: {latency_ms:.2f}ms")
print(f"Tokens per second: {output_tokens / (latency_ms / 1000):.1f}")
print(f"\nGenerated text (first 150 chars):\n{generated_text[:150]}...")

Output

Baseline model: gpt2
Input tokens: 8
Output tokens: 92
Latency: 245.67ms
Tokens per second: 374.8

Generated text (first 150 chars):
Explain quantum computing in simple terms: 

Quantum computing is a new form of computing that uses the properties of quantum mechanics to process information. It is based on the principles of superposition and entanglement...

What just happened?

We loaded GPT-2, encoded a test prompt into 8 tokens, ran generation to produce 92 new tokens, and measured the wall-clock time and token output count. This baseline gives us numbers to compare against after fine-tuning. The actual latency on your hardware will differ, but the pattern: measuring input tokens, output tokens, and milliseconds: is what matters.

Common gotcha

The most common mistake: developers fine-tune a model for 5 epochs, see 3% accuracy improvement, ship it to production, and only then notice their API timeout increased from 500ms to 2000ms or their monthly token bill doubled. They never measured latency or token count before fine-tuning, so they have no baseline to compare against. By then, they've already committed to the new model and debugging is painful.

Error recovery

RuntimeError: CUDA out of memory

You're running the model on GPU but it's too large. Move to CPU with model.to('cpu') or use a smaller model name like 'distilgpt2'. Measure latency on the same hardware you'll use in production.

TypeError: 'NoneType' object is not subscriptable

Tokenizer didn't encode the prompt properly. Ensure you're passing a string to tokenizer.encode(), not None. Add a check: assert test_prompt is not None and len(test_prompt) > 0.

KeyError when decoding

You used skip_special_tokens=True but the tokenizer config is missing. Verify AutoTokenizer.from_pretrained() downloaded correctly: print(tokenizer.special_tokens_map).

Experienced dev note

Before you spend weeks fine-tuning, run this baseline measurement on your production hardware (not your laptop). I've seen teams fine-tune a model that improved accuracy by 2% but increased latency by 5x because they only tested on CPU locally and then deployed on the same GPU cluster. The latency difference came from memory allocation patterns they never profiled. Measure early, measure on prod-like hardware, and set a cost/latency budget before fine-tuning starts. It's the difference between a 2-week project and a 2-month rollback.

Check your understanding

You fine-tune a model and accuracy improves from 87% to 91%. Output token count stays the same, but latency increases from 200ms to 800ms per request. Your SLA requires <400ms latency and your token cost budget is $5000/month. Should you ship this model? Why or why not?

Show answer hint

A correct answer explains that latency violates the SLA (800ms > 400ms), so you should NOT ship without mitigation. It should mention that token efficiency being neutral means cost stays the same, so the problem is pure latency, not cost. It should suggest quantization, layer pruning, or distillation as recovery options: or asking if the accuracy gain is worth the latency cost.

VERSION This pattern works across transformers 4.x and 5.x. The generate() API and model.to() for device placement are stable. trl 1.x fine-tuning may slightly change token efficiency depending on LoRA rank and alpha, so re-measure after any trainer configuration change.

Next, learn how to actually measure token efficiency changes by comparing your base model's token count to your fine-tuned model's token count on a held-out validation set.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.