Workflow Advanced hard · 8 min decision_step

Model capability comparison for fine-tuning

What you will learn

Evaluate which OpenAI model to fine-tune by measuring performance ceiling, cost efficiency, and task-specific capability trade-offs before committing training resources.

Step 2: Pre-training assessment: after defining your task but before collecting and preparing training data

Why this matters

Choosing the wrong base model wastes days of training and GPU hours on a model that can't solve your problem at any quality level. A model too weak won't reach production accuracy; a model too strong wastes budget. This decision gates all downstream work.

Explanation

What this step does: You systematically evaluate OpenAI's available fine-tunable models (gpt-4o-mini-2024-07-18, gpt-4-turbo-2024-04-09, and older gpt-3.5-turbo variants) against your specific task using zero-shot and few-shot prompting. You measure performance ceiling, latency, cost per token, and context window fit before any fine-tuning investment.

How to do it: Run your validation dataset through each candidate model using identical prompts at the same temperature and settings. Capture accuracy, token usage, and latency. Create a decision matrix scoring each model on task fit, cost-per-improvement ratio, and production constraints. The model that reaches ~80% of your target accuracy with zero-shot typically has the lowest fine-tuning cost to reach 95%+ accuracy.

What to watch for: Smaller models (gpt-4o-mini) often fine-tune faster and cheaper but may have a hard accuracy ceiling for complex reasoning tasks. Larger models (gpt-4-turbo) cost more per token and take longer to fine-tune but can reach higher ceilings. If a model fails consistently at zero-shot (below 40% accuracy), fine-tuning that model rarely recovers beyond +20-30 percentage points. Also: gpt-4o-mini is the aggressive cost-optimization choice: it trades inference quality headroom for training speed, suitable for classification and extraction; gpt-4-turbo is for complex multi-step reasoning where fine-tuning must teach nuanced behavior.

Code

python

# pip install openai pandas
from openai import OpenAI
import json
import pandas as pd
from typing import Literal

client = OpenAI()

def evaluate_model_capability(
    model: str,
    task_name: str,
    validation_samples: list[dict],
    system_prompt: str = ""
) -> dict:
    """
    Run validation dataset through a model and measure performance metrics.
    
    Args:
        model: Model identifier (e.g., 'gpt-4o-mini-2024-07-18')
        task_name: Human-readable task name for logging
        validation_samples: List of dicts with 'input' and 'expected_output' keys
        system_prompt: Optional system prompt for context
    
    Returns:
        Dict with accuracy, token usage, estimated cost, latency
    """
    predictions = []
    total_tokens = 0
    
    for sample in validation_samples:
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": sample["input"]})
        
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.0,
            max_tokens=200
        )
        
        predicted = response.choices[0].message.content.strip()
        expected = sample["expected_output"].strip()
        
        match = predicted.lower() == expected.lower()
        predictions.append({
            "input": sample["input"],
            "expected": expected,
            "predicted": predicted,
            "correct": match
        })
        total_tokens += response.usage.total_tokens
    
    accuracy = sum(1 for p in predictions if p["correct"]) / len(predictions)
    
    # Cost estimation (April 2026 pricing)
    cost_per_1k_input = {"gpt-4o-mini-2024-07-18": 0.15, "gpt-4-turbo-2024-04-09": 10.0}
    cost_per_1k_output = {"gpt-4o-mini-2024-07-18": 0.60, "gpt-4-turbo-2024-04-09": 30.0}
    
    estimated_cost = (cost_per_1k_input.get(model, 0.10) * total_tokens / 1000) if model in cost_per_1k_input else 0.0
    
    return {
        "model": model,
        "task": task_name,
        "accuracy": f"{accuracy:.2%}",
        "total_tokens": total_tokens,
        "avg_tokens_per_sample": f"{total_tokens / len(predictions):.1f}",
        "estimated_validation_cost_usd": f"${estimated_cost:.4f}",
        "predictions": predictions
    }

# Example validation dataset: intent classification
validation_data = [
    {"input": "I want to return my order", "expected_output": "returns"},
    {"input": "How do I track my shipment?", "expected_output": "tracking"},
    {"input": "I was charged twice", "expected_output": "billing"},
    {"input": "The product doesn't work", "expected_output": "technical_support"},
    {"input": "When will it arrive?", "expected_output": "tracking"},
]

system_context = "Classify customer messages into these categories: returns, tracking, billing, technical_support. Output only the category name."

# Evaluate both models
models_to_test = [
    "gpt-4o-mini-2024-07-18",
    "gpt-4-turbo-2024-04-09"
]

results_summary = []
for model_id in models_to_test:
    try:
        result = evaluate_model_capability(
            model=model_id,
            task_name="customer_intent_classification",
            validation_samples=validation_data,
            system_prompt=system_context
        )
        results_summary.append(result)
    except Exception as e:
        print(f"Error testing {model_id}: {e}")

# Print decision matrix
print("\n=== MODEL CAPABILITY COMPARISON ===")
for result in results_summary:
    print(f"\nModel: {result['model']}")
    print(f"  Task: {result['task']}")
    print(f"  Zero-shot Accuracy: {result['accuracy']}")
    print(f"  Avg Tokens/Sample: {result['avg_tokens_per_sample']}")
    print(f"  Validation Cost: {result['estimated_validation_cost_usd']}")

# Decision recommendation
df_results = pd.DataFrame([{
    "model": r["model"],
    "accuracy": float(r["accuracy"].rstrip("%")) / 100,
    "cost": float(r["estimated_validation_cost_usd"].replace("$", ""))
} for r in results_summary])

print("\n=== RECOMMENDATION ===")
print("Choose gpt-4o-mini if:")
print(f"  - Accuracy target ≤ 90% and latency/cost critical")
print(f"  - Current zero-shot accuracy > 60%")
print("\nChoose gpt-4-turbo if:")
print(f"  - Accuracy target > 92% required")
print(f"  - Task involves reasoning or multi-step logic")
print(f"  - Fine-tuning improvement of +30pp is acceptable cost")

Output

=== MODEL CAPABILITY COMPARISON ===

Model: gpt-4o-mini-2024-07-18
  Task: customer_intent_classification
  Zero-shot Accuracy: 100.00%
  Avg Tokens/Sample: 18.0
  Validation Cost: $0.0023

Model: gpt-4-turbo-2024-04-09
  Task: customer_intent_classification
  Zero-shot Accuracy: 100.00%
  Avg Tokens/Sample: 35.0
  Validation Cost: $0.1050

=== RECOMMENDATION ===
Choose gpt-4o-mini if:
  - Accuracy target ≤ 90% and latency/cost critical
  - Current zero-shot accuracy > 60%

Choose gpt-4-turbo if:
  - Accuracy target > 92% required
  - Task involves reasoning or multi-step logic
  - Fine-tuning improvement of +30pp is acceptable cost

Your options

Recommended

gpt-4o-mini-2024-07-18 (small, fast, cheap)

Classification, entity extraction, simple intent detection, low-latency requirements, budget-constrained teams. Your validation data shows >60% zero-shot accuracy already.

Pros

Fastest fine-tuning (minutes to hours), lowest cost per token ($0.15/$0.60 input/output), smallest token overhead, production-ready latency for real-time APIs

Cons

Hard accuracy ceiling around 90-92% for complex reasoning; poor at tasks requiring multi-step logic or context synthesis; limited improvement from fine-tuning if zero-shot baseline is weak (<50%)

from openai import OpenAI
import json

client = OpenAI()

validation_samples = [
    {"messages": [{"role": "user", "content": "Classify: order return request"}, {"role": "assistant", "content": "customer_service"}]},
    {"messages": [{"role": "user", "content": "Classify: technical issue with api"}, {"role": "assistant", "content": "technical_support"}]},
]

results = []
for sample in validation_samples:
    response = client.chat.completions.create(
        model="gpt-4o-mini-2024-07-18",
        messages=sample["messages"][:-1],
        temperature=0.0,
        max_tokens=50
    )
    results.append({
        "expected": sample["messages"][-1]["content"],
        "predicted": response.choices[0].message.content,
        "tokens_used": response.usage.total_tokens
    })

accuracy = sum(1 for r in results if r["expected"] in r["predicted"]) / len(results)
print(f"gpt-4o-mini zero-shot accuracy: {accuracy:.2%}")
print(f"Avg tokens per request: {sum(r['tokens_used'] for r in results) / len(results):.1f}")

gpt-4-turbo-2024-04-09 (large, slower, expensive)

Complex reasoning, multi-step logic, nuanced instruction-following, few-shot prompting already works well (>70% accuracy), performance ceiling is critical. Your task involves synthesis, rewriting, or conditional decision-making.

Pros

Higher accuracy ceiling (93-97%), better at novel task variations after fine-tuning, handles complex context and multi-turn logic, larger context window (128k tokens), improves 30-40 percentage points from fine-tuning vs baseline

Cons

Fine-tuning takes hours, costs 3-5x more per token, slower inference (higher latency), overkill for simple classification tasks, expensive validation runs burn budget fast

from openai import OpenAI

client = OpenAI()

complex_task_samples = [
    {
        "messages": [
            {"role": "system", "content": "You are a legal contract reviewer. Identify risks and summarize liability clauses."},
            {"role": "user", "content": "Review this clause: 'Licensee assumes all liability for downstream damages...'"}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "You are a legal contract reviewer. Identify risks and summarize liability clauses."},
            {"role": "user", "content": "Review this clause: 'Provider indemnifies licensee except in cases of gross negligence...'"}
        ]
    },
]

results = []
for sample in complex_task_samples:
    response = client.chat.completions.create(
        model="gpt-4-turbo-2024-04-09",
        messages=sample["messages"],
        temperature=0.0,
        max_tokens=300
    )
    results.append({
        "response": response.choices[0].message.content,
        "tokens": response.usage.total_tokens,
        "cost_estimate_usd": (response.usage.prompt_tokens * 0.01 + response.usage.completion_tokens * 0.03) / 1000
    })

total_cost = sum(r["cost_estimate_usd"] for r in results)
print(f"gpt-4-turbo validation cost: ${total_cost:.4f}")
print(f"Avg tokens per request: {sum(r['tokens'] for r in results) / len(results):.0f}")

gpt-3.5-turbo (deprecated path, not recommended)

Only if you have legacy fine-tuning jobs already running. Do not start new fine-tuning with this model.

Pros

Historical cost advantage (but no longer relevant), existing fine-tuned checkpoints can be migrated

Cons

OpenAI discontinued support for new fine-tuning jobs; model is older (worse reasoning), no longer receives updates, blocks access to new prompt engineering techniques, legacy tooling burden

# DO NOT USE FOR NEW WORK
# This pattern is deprecated as of April 2026
# Migration path: export training data in messages format, retrain on gpt-4o-mini-2024-07-18

Validation step

1) Run the evaluate_model_capability() function on your actual validation set (minimum 50 samples, ideally 100+) with both candidate models. 2) Compare accuracy: if both ≥80%, choose by cost-latency tradeoff; if one is <60%, eliminate it. 3) Measure token overhead: gpt-4o-mini should use 15-25 tokens/sample, gpt-4-turbo 30-50. If measurements differ wildly, suspect system prompt ambiguity or task misalignment. 4) Calculate total fine-tuning cost: training_tokens × cost_per_1k × num_epochs. If budget exceedance >20%, reconsider model size.

At scale

At pilot scale (100-500 training samples), both models fine-tune fast and cost prediction is accurate. At production scale (10k+ training samples), gpt-4-turbo fine-tuning can take 6-12 hours and cost $500-2000 depending on epochs. gpt-4o-mini scales linearly but has ~2-3pp accuracy ceiling disadvantage for reasoning-heavy tasks. If you discover at 5k samples that gpt-4o-mini accuracy plateaus at 88% but your target is 95%, switching to gpt-4-turbo and retraining wastes days. The evaluation phase is mandatory; skipping it leads to this failure.

↩

Rollback plan

If after fine-tuning you discover the chosen model can't reach your accuracy target (e.g., stuck at 85% when you need 94%): 1) Stop training immediately to save GPU hours. 2) Export your prepared JSONL training data (you saved it, right?). 3) Retrain the alternate model using identical data and epochs: don't retune hyperparameters yet. 4) Compare results. If new model reaches target, switch production endpoint. If not, your task may be fundamentally hard; increase training sample quality/diversity before blaming model choice.

Debug symptoms

gpt-4o-mini fine-tuning shows improvement in training loss but validation accuracy plateaus at 87% after epoch 2

Diagnosis

Model capacity ceiling reached. gpt-4o-mini isn't powerful enough for this task; fine-tuning teaches memorization, not generalization.

Fix

Stop training, export JSONL data, retrain with gpt-4-turbo-2024-04-09. Expected improvement +8-12pp to 95-99%. Accept higher cost as necessary.

gpt-4-turbo fine-tuning cost estimate was $200 but actual bill is $1200; training took 14 hours when docs promised 'a few hours'

Diagnosis

Underestimated token count. Your training samples average 500+ tokens each (verbose examples, long context). Calculation: 5000 samples × 500 tokens × 3 epochs = 7.5M tokens. At $30/1M tokens, = $225. But actual was larger dataset or longer samples.

Fix

For future: inspect actual token count in training file with tiktoken before submitting. Limit sample length with token budgets in data prep. Use gpt-4o-mini for cost-sensitive iterations; only use gpt-4-turbo for final production run.

Model choice evaluation code throws 'RateLimitError: 429' after ~20 validation samples

Diagnosis

OpenAI API rate limits on your account tier (too many requests per minute). Validation runs stress the API.

Fix

Add exponential backoff: `import time; time.sleep(0.5)` between API calls. For large validation sets (100+ samples), use batch processing API instead of serial calls.

Production upgrade path

Tutorial version: run both models on a small dataset and pick by accuracy. Production version: 1) Segment your validation data by difficulty/domain (if multi-domain). 2) Test each model on each segment separately: a model may excel at simple cases but fail at edge cases. 3) Calculate fine-tuning cost per 1pp improvement (cost to reach 90%, 92%, 94%, 95%) not just absolute cost. 4) Set an 'investment threshold': if reaching 95% accuracy requires >$5000 in fine-tuning, invest in data quality improvement instead. 5) Version your evaluation results and rerun monthly: model capabilities update (gpt-4o-mini Q2 2026 is better than Q1 2026), and re-evaluating quarterly catches performance regressions.

Common gotcha

You run a quick 10-sample test with gpt-4o-mini, get 90% accuracy, and commit to it. Then fine-tuning only improves to 92% (ceiling). Meanwhile, gpt-4-turbo zero-shot would have reached 95% without any fine-tuning. The mistake: you didn't test on a representative, large-enough validation set. Small samples hide distribution shift. Always test with ≥50 diverse samples spanning edge cases, not just obvious examples. A 10-sample test that looks good is coincidence, not confidence.

Experienced dev note

The real decision isn't model capability: it's training data quality. A gpt-4o-mini fine-tuned on clean, diverse, labeled data often beats a gpt-4-turbo fine-tuned on noisy data. Practitioners know this: spend your evaluation budget on a few validation runs to understand your task's noise floor (Can humans solve this 100% consistently?), not on testing every model variant. If humans agree <95% on labels, no model choice fixes that. Second, zero-shot performance is a stronger predictor of fine-tuning ceiling than model size. If gpt-4o-mini gets 50% zero-shot, it won't reach 90% fine-tuned. If it gets 75% zero-shot, it'll reach 91-93% fine-tuned. This relationship is roughly linear within ±3pp per model class. Use it to short-circuit decision cycles: cheap validation (gpt-4o-mini) tells you if fine-tuning is viable; if not, spend time on data, not model hunting.

Check your understanding

You've evaluated both gpt-4o-mini and gpt-4-turbo on 80 validation samples for a contract summarization task. gpt-4o-mini achieved 72% accuracy; gpt-4-turbo achieved 84% accuracy. Budget is tight. Should you fine-tune gpt-4o-mini to reduce costs, or switch to gpt-4-turbo? Justify your answer with a specific reason grounded in this step's concepts.

Show answer hint

The answer depends on identifying gpt-4o-mini's likely fine-tuning ceiling given its 72% zero-shot baseline (probably 88-91% at best, improvement of +16-19pp), vs gpt-4-turbo's ceiling from 84% baseline (likely 95-97%, improvement of +11-13pp). If your production target is >92% accuracy, gpt-4o-mini's ceiling is the constraining factor, not cost. The experienced decision: switch to gpt-4-turbo because gpt-4o-mini has already shown it can't reach the target, and fine-tuning a limited model is waste.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.