Model capability comparison for fine-tuning
Why this matters
Choosing the wrong base model wastes days of training and GPU hours on a model that can't solve your problem at any quality level. A model too weak won't reach production accuracy; a model too strong wastes budget. This decision gates all downstream work.
Explanation
What this step does: You systematically evaluate OpenAI's available fine-tunable models (gpt-4o-mini-2024-07-18, gpt-4-turbo-2024-04-09, and older gpt-3.5-turbo variants) against your specific task using zero-shot and few-shot prompting. You measure performance ceiling, latency, cost per token, and context window fit before any fine-tuning investment.
How to do it: Run your validation dataset through each candidate model using identical prompts at the same temperature and settings. Capture accuracy, token usage, and latency. Create a decision matrix scoring each model on task fit, cost-per-improvement ratio, and production constraints. The model that reaches ~80% of your target accuracy with zero-shot typically has the lowest fine-tuning cost to reach 95%+ accuracy.
What to watch for: Smaller models (gpt-4o-mini) often fine-tune faster and cheaper but may have a hard accuracy ceiling for complex reasoning tasks. Larger models (gpt-4-turbo) cost more per token and take longer to fine-tune but can reach higher ceilings. If a model fails consistently at zero-shot (below 40% accuracy), fine-tuning that model rarely recovers beyond +20-30 percentage points. Also: gpt-4o-mini is the aggressive cost-optimization choice: it trades inference quality headroom for training speed, suitable for classification and extraction; gpt-4-turbo is for complex multi-step reasoning where fine-tuning must teach nuanced behavior.
Code
# pip install openai pandas
from openai import OpenAI
import json
import pandas as pd
from typing import Literal
client = OpenAI()
def evaluate_model_capability(
model: str,
task_name: str,
validation_samples: list[dict],
system_prompt: str = ""
) -> dict:
"""
Run validation dataset through a model and measure performance metrics.
Args:
model: Model identifier (e.g., 'gpt-4o-mini-2024-07-18')
task_name: Human-readable task name for logging
validation_samples: List of dicts with 'input' and 'expected_output' keys
system_prompt: Optional system prompt for context
Returns:
Dict with accuracy, token usage, estimated cost, latency
"""
predictions = []
total_tokens = 0
for sample in validation_samples:
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": sample["input"]})
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.0,
max_tokens=200
)
predicted = response.choices[0].message.content.strip()
expected = sample["expected_output"].strip()
match = predicted.lower() == expected.lower()
predictions.append({
"input": sample["input"],
"expected": expected,
"predicted": predicted,
"correct": match
})
total_tokens += response.usage.total_tokens
accuracy = sum(1 for p in predictions if p["correct"]) / len(predictions)
# Cost estimation (April 2026 pricing)
cost_per_1k_input = {"gpt-4o-mini-2024-07-18": 0.15, "gpt-4-turbo-2024-04-09": 10.0}
cost_per_1k_output = {"gpt-4o-mini-2024-07-18": 0.60, "gpt-4-turbo-2024-04-09": 30.0}
estimated_cost = (cost_per_1k_input.get(model, 0.10) * total_tokens / 1000) if model in cost_per_1k_input else 0.0
return {
"model": model,
"task": task_name,
"accuracy": f"{accuracy:.2%}",
"total_tokens": total_tokens,
"avg_tokens_per_sample": f"{total_tokens / len(predictions):.1f}",
"estimated_validation_cost_usd": f"${estimated_cost:.4f}",
"predictions": predictions
}
# Example validation dataset: intent classification
validation_data = [
{"input": "I want to return my order", "expected_output": "returns"},
{"input": "How do I track my shipment?", "expected_output": "tracking"},
{"input": "I was charged twice", "expected_output": "billing"},
{"input": "The product doesn't work", "expected_output": "technical_support"},
{"input": "When will it arrive?", "expected_output": "tracking"},
]
system_context = "Classify customer messages into these categories: returns, tracking, billing, technical_support. Output only the category name."
# Evaluate both models
models_to_test = [
"gpt-4o-mini-2024-07-18",
"gpt-4-turbo-2024-04-09"
]
results_summary = []
for model_id in models_to_test:
try:
result = evaluate_model_capability(
model=model_id,
task_name="customer_intent_classification",
validation_samples=validation_data,
system_prompt=system_context
)
results_summary.append(result)
except Exception as e:
print(f"Error testing {model_id}: {e}")
# Print decision matrix
print("\n=== MODEL CAPABILITY COMPARISON ===")
for result in results_summary:
print(f"\nModel: {result['model']}")
print(f" Task: {result['task']}")
print(f" Zero-shot Accuracy: {result['accuracy']}")
print(f" Avg Tokens/Sample: {result['avg_tokens_per_sample']}")
print(f" Validation Cost: {result['estimated_validation_cost_usd']}")
# Decision recommendation
df_results = pd.DataFrame([{
"model": r["model"],
"accuracy": float(r["accuracy"].rstrip("%")) / 100,
"cost": float(r["estimated_validation_cost_usd"].replace("$", ""))
} for r in results_summary])
print("\n=== RECOMMENDATION ===")
print("Choose gpt-4o-mini if:")
print(f" - Accuracy target ≤ 90% and latency/cost critical")
print(f" - Current zero-shot accuracy > 60%")
print("\nChoose gpt-4-turbo if:")
print(f" - Accuracy target > 92% required")
print(f" - Task involves reasoning or multi-step logic")
print(f" - Fine-tuning improvement of +30pp is acceptable cost") === MODEL CAPABILITY COMPARISON === Model: gpt-4o-mini-2024-07-18 Task: customer_intent_classification Zero-shot Accuracy: 100.00% Avg Tokens/Sample: 18.0 Validation Cost: $0.0023 Model: gpt-4-turbo-2024-04-09 Task: customer_intent_classification Zero-shot Accuracy: 100.00% Avg Tokens/Sample: 35.0 Validation Cost: $0.1050 === RECOMMENDATION === Choose gpt-4o-mini if: - Accuracy target ≤ 90% and latency/cost critical - Current zero-shot accuracy > 60% Choose gpt-4-turbo if: - Accuracy target > 92% required - Task involves reasoning or multi-step logic - Fine-tuning improvement of +30pp is acceptable cost
Your options
gpt-4o-mini-2024-07-18 (small, fast, cheap)
Classification, entity extraction, simple intent detection, low-latency requirements, budget-constrained teams. Your validation data shows >60% zero-shot accuracy already.
Pros
Fastest fine-tuning (minutes to hours), lowest cost per token ($0.15/$0.60 input/output), smallest token overhead, production-ready latency for real-time APIs
Cons
Hard accuracy ceiling around 90-92% for complex reasoning; poor at tasks requiring multi-step logic or context synthesis; limited improvement from fine-tuning if zero-shot baseline is weak (<50%)
from openai import OpenAI
import json
client = OpenAI()
validation_samples = [
{"messages": [{"role": "user", "content": "Classify: order return request"}, {"role": "assistant", "content": "customer_service"}]},
{"messages": [{"role": "user", "content": "Classify: technical issue with api"}, {"role": "assistant", "content": "technical_support"}]},
]
results = []
for sample in validation_samples:
response = client.chat.completions.create(
model="gpt-4o-mini-2024-07-18",
messages=sample["messages"][:-1],
temperature=0.0,
max_tokens=50
)
results.append({
"expected": sample["messages"][-1]["content"],
"predicted": response.choices[0].message.content,
"tokens_used": response.usage.total_tokens
})
accuracy = sum(1 for r in results if r["expected"] in r["predicted"]) / len(results)
print(f"gpt-4o-mini zero-shot accuracy: {accuracy:.2%}")
print(f"Avg tokens per request: {sum(r['tokens_used'] for r in results) / len(results):.1f}") gpt-4-turbo-2024-04-09 (large, slower, expensive)
Complex reasoning, multi-step logic, nuanced instruction-following, few-shot prompting already works well (>70% accuracy), performance ceiling is critical. Your task involves synthesis, rewriting, or conditional decision-making.
Pros
Higher accuracy ceiling (93-97%), better at novel task variations after fine-tuning, handles complex context and multi-turn logic, larger context window (128k tokens), improves 30-40 percentage points from fine-tuning vs baseline
Cons
Fine-tuning takes hours, costs 3-5x more per token, slower inference (higher latency), overkill for simple classification tasks, expensive validation runs burn budget fast
from openai import OpenAI
client = OpenAI()
complex_task_samples = [
{
"messages": [
{"role": "system", "content": "You are a legal contract reviewer. Identify risks and summarize liability clauses."},
{"role": "user", "content": "Review this clause: 'Licensee assumes all liability for downstream damages...'"}
]
},
{
"messages": [
{"role": "system", "content": "You are a legal contract reviewer. Identify risks and summarize liability clauses."},
{"role": "user", "content": "Review this clause: 'Provider indemnifies licensee except in cases of gross negligence...'"}
]
},
]
results = []
for sample in complex_task_samples:
response = client.chat.completions.create(
model="gpt-4-turbo-2024-04-09",
messages=sample["messages"],
temperature=0.0,
max_tokens=300
)
results.append({
"response": response.choices[0].message.content,
"tokens": response.usage.total_tokens,
"cost_estimate_usd": (response.usage.prompt_tokens * 0.01 + response.usage.completion_tokens * 0.03) / 1000
})
total_cost = sum(r["cost_estimate_usd"] for r in results)
print(f"gpt-4-turbo validation cost: ${total_cost:.4f}")
print(f"Avg tokens per request: {sum(r['tokens'] for r in results) / len(results):.0f}") gpt-3.5-turbo (deprecated path, not recommended)
Only if you have legacy fine-tuning jobs already running. Do not start new fine-tuning with this model.
Pros
Historical cost advantage (but no longer relevant), existing fine-tuned checkpoints can be migrated
Cons
OpenAI discontinued support for new fine-tuning jobs; model is older (worse reasoning), no longer receives updates, blocks access to new prompt engineering techniques, legacy tooling burden
# DO NOT USE FOR NEW WORK
# This pattern is deprecated as of April 2026
# Migration path: export training data in messages format, retrain on gpt-4o-mini-2024-07-18 Validation step
1) Run the evaluate_model_capability() function on your actual validation set (minimum 50 samples, ideally 100+) with both candidate models. 2) Compare accuracy: if both ≥80%, choose by cost-latency tradeoff; if one is <60%, eliminate it. 3) Measure token overhead: gpt-4o-mini should use 15-25 tokens/sample, gpt-4-turbo 30-50. If measurements differ wildly, suspect system prompt ambiguity or task misalignment. 4) Calculate total fine-tuning cost: training_tokens × cost_per_1k × num_epochs. If budget exceedance >20%, reconsider model size.
At scale
At pilot scale (100-500 training samples), both models fine-tune fast and cost prediction is accurate. At production scale (10k+ training samples), gpt-4-turbo fine-tuning can take 6-12 hours and cost $500-2000 depending on epochs. gpt-4o-mini scales linearly but has ~2-3pp accuracy ceiling disadvantage for reasoning-heavy tasks. If you discover at 5k samples that gpt-4o-mini accuracy plateaus at 88% but your target is 95%, switching to gpt-4-turbo and retraining wastes days. The evaluation phase is mandatory; skipping it leads to this failure.
Rollback plan
If after fine-tuning you discover the chosen model can't reach your accuracy target (e.g., stuck at 85% when you need 94%): 1) Stop training immediately to save GPU hours. 2) Export your prepared JSONL training data (you saved it, right?). 3) Retrain the alternate model using identical data and epochs: don't retune hyperparameters yet. 4) Compare results. If new model reaches target, switch production endpoint. If not, your task may be fundamentally hard; increase training sample quality/diversity before blaming model choice.
Debug symptoms
gpt-4o-mini fine-tuning shows improvement in training loss but validation accuracy plateaus at 87% after epoch 2
Diagnosis
Model capacity ceiling reached. gpt-4o-mini isn't powerful enough for this task; fine-tuning teaches memorization, not generalization.
Fix
Stop training, export JSONL data, retrain with gpt-4-turbo-2024-04-09. Expected improvement +8-12pp to 95-99%. Accept higher cost as necessary.
gpt-4-turbo fine-tuning cost estimate was $200 but actual bill is $1200; training took 14 hours when docs promised 'a few hours'
Diagnosis
Underestimated token count. Your training samples average 500+ tokens each (verbose examples, long context). Calculation: 5000 samples × 500 tokens × 3 epochs = 7.5M tokens. At $30/1M tokens, = $225. But actual was larger dataset or longer samples.
Fix
For future: inspect actual token count in training file with tiktoken before submitting. Limit sample length with token budgets in data prep. Use gpt-4o-mini for cost-sensitive iterations; only use gpt-4-turbo for final production run.
Model choice evaluation code throws 'RateLimitError: 429' after ~20 validation samples
Diagnosis
OpenAI API rate limits on your account tier (too many requests per minute). Validation runs stress the API.
Fix
Add exponential backoff: `import time; time.sleep(0.5)` between API calls. For large validation sets (100+ samples), use batch processing API instead of serial calls.
Production upgrade path
Tutorial version: run both models on a small dataset and pick by accuracy. Production version: 1) Segment your validation data by difficulty/domain (if multi-domain). 2) Test each model on each segment separately: a model may excel at simple cases but fail at edge cases. 3) Calculate fine-tuning cost per 1pp improvement (cost to reach 90%, 92%, 94%, 95%) not just absolute cost. 4) Set an 'investment threshold': if reaching 95% accuracy requires >$5000 in fine-tuning, invest in data quality improvement instead. 5) Version your evaluation results and rerun monthly: model capabilities update (gpt-4o-mini Q2 2026 is better than Q1 2026), and re-evaluating quarterly catches performance regressions.
Common gotcha
You run a quick 10-sample test with gpt-4o-mini, get 90% accuracy, and commit to it. Then fine-tuning only improves to 92% (ceiling). Meanwhile, gpt-4-turbo zero-shot would have reached 95% without any fine-tuning. The mistake: you didn't test on a representative, large-enough validation set. Small samples hide distribution shift. Always test with ≥50 diverse samples spanning edge cases, not just obvious examples. A 10-sample test that looks good is coincidence, not confidence.
Experienced dev note
The real decision isn't model capability: it's training data quality. A gpt-4o-mini fine-tuned on clean, diverse, labeled data often beats a gpt-4-turbo fine-tuned on noisy data. Practitioners know this: spend your evaluation budget on a few validation runs to understand your task's noise floor (Can humans solve this 100% consistently?), not on testing every model variant. If humans agree <95% on labels, no model choice fixes that. Second, zero-shot performance is a stronger predictor of fine-tuning ceiling than model size. If gpt-4o-mini gets 50% zero-shot, it won't reach 90% fine-tuned. If it gets 75% zero-shot, it'll reach 91-93% fine-tuned. This relationship is roughly linear within ±3pp per model class. Use it to short-circuit decision cycles: cheap validation (gpt-4o-mini) tells you if fine-tuning is viable; if not, spend time on data, not model hunting.
Check your understanding
You've evaluated both gpt-4o-mini and gpt-4-turbo on 80 validation samples for a contract summarization task. gpt-4o-mini achieved 72% accuracy; gpt-4-turbo achieved 84% accuracy. Budget is tight. Should you fine-tune gpt-4o-mini to reduce costs, or switch to gpt-4-turbo? Justify your answer with a specific reason grounded in this step's concepts.
Show answer hint
The answer depends on identifying gpt-4o-mini's likely fine-tuning ceiling given its 72% zero-shot baseline (probably 88-91% at best, improvement of +16-19pp), vs gpt-4-turbo's ceiling from 84% baseline (likely 95-97%, improvement of +11-13pp). If your production target is >92% accuracy, gpt-4o-mini's ceiling is the constraining factor, not cost. The experienced decision: switch to gpt-4-turbo because gpt-4o-mini has already shown it can't reach the target, and fine-tuning a limited model is waste.