Ignoring latency and cost changes
Why this matters
A finely tuned model that cuts your input token count by 30% but increases latency by 3 seconds per request can actually make your product worse and more expensive, not better. Developers often only measure accuracy and miss the operational impact.
Explanation
When you fine-tune an LLM, three metrics change: accuracy (which everyone measures), latency (how long inference takes), and token efficiency (how many tokens your model uses for the same task). Most developers obsess over accuracy and ignore the other two.
Latency changes because fine-tuning modifies the model's internal weights, which can slow down matrix multiplications slightly, or speed them up if you used quantization. Token efficiency changes because your fine-tuned model may learn to be more verbose or more concise than the base model. If your original model uses 100 tokens per response at $0.001/token, and your fine-tuned model uses 150 tokens at the same cost, you've just increased your per-request cost by 50%: even if accuracy improved by 5%.
You should establish a baseline (latency and token count from the original model) before fine-tuning, then measure the same metrics on your fine-tuned model after training. If latency increases beyond your SLA or token count increases more than your cost budget allows, you need to adjust: add quantization, reduce fine-tuning epochs, or reconsider whether fine-tuning is the right solution.
Analogy
Upgrading a car engine for more power: you measure 0-60 times and fuel economy before and after. If the new engine is 10% faster but uses 40% more gas and your budget only allows 20% more fuel spend, the upgrade backfires despite being 'better.'
Code
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
if torch.cuda.is_available():
model = model.to("cuda")
test_prompt = "Explain quantum computing in simple terms:"
input_ids = tokenizer.encode(test_prompt, return_tensors="pt")
if torch.cuda.is_available():
input_ids = input_ids.to("cuda")
print(f"Baseline model: {model_name}")
print(f"Input tokens: {input_ids.shape[1]}")
start_time = time.time()
with torch.no_grad():
output = model.generate(
input_ids,
max_length=100,
num_return_sequences=1,
temperature=0.7
)
latency_ms = (time.time() - start_time) * 1000
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
output_tokens = output.shape[1] - input_ids.shape[1]
print(f"Output tokens: {output_tokens}")
print(f"Latency: {latency_ms:.2f}ms")
print(f"Tokens per second: {output_tokens / (latency_ms / 1000):.1f}")
print(f"\nGenerated text (first 150 chars):\n{generated_text[:150]}...") Baseline model: gpt2 Input tokens: 8 Output tokens: 92 Latency: 245.67ms Tokens per second: 374.8 Generated text (first 150 chars): Explain quantum computing in simple terms: Quantum computing is a new form of computing that uses the properties of quantum mechanics to process information. It is based on the principles of superposition and entanglement...
What just happened?
We loaded GPT-2, encoded a test prompt into 8 tokens, ran generation to produce 92 new tokens, and measured the wall-clock time and token output count. This baseline gives us numbers to compare against after fine-tuning. The actual latency on your hardware will differ, but the pattern: measuring input tokens, output tokens, and milliseconds: is what matters.
Common gotcha
The most common mistake: developers fine-tune a model for 5 epochs, see 3% accuracy improvement, ship it to production, and only then notice their API timeout increased from 500ms to 2000ms or their monthly token bill doubled. They never measured latency or token count before fine-tuning, so they have no baseline to compare against. By then, they've already committed to the new model and debugging is painful.
Error recovery
RuntimeError: CUDA out of memoryTypeError: 'NoneType' object is not subscriptableKeyError when decodingExperienced dev note
Before you spend weeks fine-tuning, run this baseline measurement on your production hardware (not your laptop). I've seen teams fine-tune a model that improved accuracy by 2% but increased latency by 5x because they only tested on CPU locally and then deployed on the same GPU cluster. The latency difference came from memory allocation patterns they never profiled. Measure early, measure on prod-like hardware, and set a cost/latency budget before fine-tuning starts. It's the difference between a 2-week project and a 2-month rollback.
Check your understanding
You fine-tune a model and accuracy improves from 87% to 91%. Output token count stays the same, but latency increases from 200ms to 800ms per request. Your SLA requires <400ms latency and your token cost budget is $5000/month. Should you ship this model? Why or why not?
Show answer hint
A correct answer explains that latency violates the SLA (800ms > 400ms), so you should NOT ship without mitigation. It should mention that token efficiency being neutral means cost stays the same, so the problem is pure latency, not cost. It should suggest quantization, layer pruning, or distillation as recovery options: or asking if the accuracy gain is worth the latency cost.