ModelSelectionDecision
whisper.load_model(): model size selection (not an exception, a configuration decision)
Stack trace
# Not an exception — this is a decision point in your code.
# Common symptom: slow inference on edge devices or poor accuracy on production audio
# Example incorrect choice:
model = whisper.load_model('large-v3') # Crashes on 2GB RAM or takes 30+ sec per audio file
# vs.
model = whisper.load_model('tiny') # Fast but 10-15% word error rate on noisy audio Why it happens
Whisper offers five model sizes, each a different point on the accuracy-speed-memory spectrum. The tiny model is 39MB and runs on CPU in 1-2 seconds per minute of audio, but has 10-15% WER (word error rate) on challenging audio. The large-v3 model is 2.9GB, requires GPU for reasonable speed (3-5 sec per minute), but achieves 3-5% WER on the same audio. Choosing blindly without understanding your constraints (latency budget, hardware, accuracy requirements, cost) leads to either unacceptable quality or unacceptable latency in production.
Detection
Profile your inference latency and accuracy on a representative audio sample before deploying. Use whisper.transcribe() with timing (time.time()) and compare the output against ground truth for your use case. If latency >5x your budget or WER >acceptable threshold, you've chosen wrong.
Causes & fixes
Production audio is noisy or accented, but you chose 'tiny' or 'base' for speed: accuracy is 10-15% WER instead of required <5%
Move to 'medium' (312MB, 5-10% WER) or 'large-v3' (2.9GB, 3-5% WER). If GPU unavailable, use OpenAI API (whisper-1 endpoint) which runs large-v3 server-side, or invest in GPU for local inference.
You chose 'large-v3' for maximum accuracy, but edge device has only 2GB RAM or latency SLA is <2 seconds per minute: inference times out or OOM crashes
Drop to 'medium' (312MB, still >95% as accurate as large-v3 for clean audio) or 'small' (141MB). Test on target hardware. For real-time transcription, use 'base' + streaming via OpenAI API or faster-whisper stream mode.
You deployed 'base' or 'small' locally without measuring accuracy first: users report poor quality on background noise, accents, or technical jargon
Benchmark WER on 10-20 representative audio samples from your actual use case before going live. If WER >5%, upgrade to 'medium' or use OpenAI API. For specialized audio (medical, legal, code), large-v3 is often required.
Inference pipeline is CPU-bound on 'large-v3' and takes 30+ seconds per minute of audio: unacceptable latency in production
Add GPU acceleration (CUDA/Metal via torch) or switch to faster-whisper (Faster Transformer, 2-4x speedup). For real-time use, use OpenAI API whisper-1 endpoint (runs on their GPU, <1 sec per minute). For local-only, drop to 'medium' + GPU.
Code: broken vs fixed
import whisper
import time
import os
# WRONG: Choosing model size without benchmarking
model_choice = 'tiny' # Assumes tiny is fast enough and accurate enough
model = whisper.load_model(model_choice)
audio_file = 'sample_call_center.wav'
start = time.time()
result = model.transcribe(audio_file)
latency = time.time() - start
print(f'Model: {model_choice}')
print(f'Transcription: {result["text"]}')
print(f'Latency: {latency:.2f}s') # Assume this is acceptable
# PROBLEM: tiny model achieves 15% WER on noisy call center audio (unacceptable for compliance)
# PROBLEM: no comparison to other sizes, no measurement against ground truth import whisper
import time
import os
import json
from typing import Dict
# CORRECT: Benchmark all model sizes on your actual audio before committing
AUDIO_FILE = 'sample_call_center.wav' # Representative audio from your use case
GROUND_TRUTH_TRANSCRIPT = """The customer called to report billing issues.
Please escalate to the billing department."""
def benchmark_model(model_name: str, audio_file: str) -> Dict:
"""Benchmark a single model: latency, output, and WER estimate."""
model = whisper.load_model(model_name)
start = time.time()
result = model.transcribe(audio_file)
latency = time.time() - start
# Simple WER approximation: count differing words
predicted_words = set(result['text'].lower().split())
ground_words = set(GROUND_TRUTH_TRANSCRIPT.lower().split())
matches = len(predicted_words & ground_words)
wer_approx = (len(ground_words) - matches) / len(ground_words) if ground_words else 0
return {
'model': model_name,
'latency_sec': round(latency, 2),
'output': result['text'],
'wer_approx': round(wer_approx, 3),
'confidence': result.get('segments', [{}])[0].get('confidence', 'N/A')
}
# FIXED: Benchmark all candidates on YOUR audio
models_to_test = ['tiny', 'base', 'small', 'medium']
results = []
print(f'Benchmarking on: {AUDIO_FILE}')
print(f'Ground truth: {GROUND_TRUTH_TRANSCRIPT}')
print()
for model_name in models_to_test:
print(f'Testing {model_name}...')
result = benchmark_model(model_name, AUDIO_FILE)
results.append(result)
print(f' Latency: {result["latency_sec"]}s, WER: {result["wer_approx"]}, Output: {result["output"][:60]}...')
print()
print('Summary:')
for r in results:
print(f'{r["model"]:10} | Latency: {r["latency_sec"]:6.2f}s | WER: {r["wer_approx"]:.3f}')
print()
print('Recommendation:')
print('- If WER <0.05 AND latency <2s: choose smallest model that meets criteria')
print('- If latency >5s: use GPU (torch.cuda) or OpenAI API whisper-1')
print('- If WER >0.10: upgrade to medium or large-v3')
# FIXED CHOICE: Select based on benchmark
best = min(results, key=lambda x: (x['wer_approx'] if x['latency_sec'] < 2.0 else 999, x['latency_sec']))
print(f'\nRecommended model for your use case: {best["model"]}')
model = whisper.load_model(best['model'])
final_result = model.transcribe(AUDIO_FILE)
print(f'Final output: {final_result["text"]}') Workaround
If you cannot afford to benchmark before deploying: start with 'base' (74MB, reasonable accuracy and speed tradeoff) for initial launch. Monitor real user feedback and WER via a small validation set. If users report poor quality >1% of the time, upgrade to 'medium' or OpenAI API. If latency is >2s per minute on your hardware, drop to 'small' + GPU, or use OpenAI API for cloud-hosted inference.
Prevention
Build model selection into your pre-deployment pipeline: (1) Define your accuracy SLA (WER threshold, e.g., <5%) and latency SLA (e.g., <2s per minute). (2) Benchmark the 3-4 most plausible model sizes on 20+ representative audio samples from your actual use case. (3) Measure on your target hardware (CPU/GPU, memory). (4) Choose the smallest model meeting both SLAs. (5) Re-benchmark quarterly as your audio distribution changes. (6) For production audio with accents, noise, or jargon: bias toward 'medium' or 'large-v3' until proven otherwise by data.