Structured Course

Model Selection

From first install to production patterns. Every lesson is standalone — jump to what you need, or work through from beginner to advanced.

147 lessons 3 levels Beginner → Advanced

Beginner

49 lessons · 7 chapters

See all →

Why Model Selection Matters 7

The Model Landscape 2026 7

Evaluation Dimensions 7

+4 more chapters

Start Beginner →

Intermediate

49 lessons · 7 chapters

See all →

Reasoning Models Selection 7

Open Source Model Selection 7

Vision Model Selection 7

+4 more chapters

Start Intermediate →

Advanced

49 lessons · 7 chapters

See all →

Capability Assessment Framework 7

Cost and Latency Matrix 7

Open Source vs Proprietary Decision 7

+4 more chapters

Start Advanced →

Full Course Contents

Beginner

49 lessons

1 Why Model Selection Matters 7

No single best model for all tasks The model that wins on benchmarks often fails in production because the benchmark doesn't match your domain, data, latency requirements, or regulatory constraints.

Cost, quality, and latency triangle Every production model decision is a forced tradeoff between three constraints that pull against each other: you cannot optimize all three simultaneously.

Feature availability differences The model you want may not support the inference framework, deployment region, or cost model your business needs.

Vendor lock-in considerations Choosing a vendor's proprietary model now can cost you 2-3x more and 6+ months of rework later when you need to switch.

The model landscape: 2026 The model you choose is determined by your constraints: cost, latency, regulation, and data access: not by which model is "best."

Selection as an Ongoing Process Your first model choice is never your last: model selection is a continuous cycle driven by data drift, business changes, and new competitive models, not a one-time decision.

The cost of wrong model choice Choosing the wrong model early locks you into months of wasted compute, compliance rework, and architectural debt that no amount of fine-tuning will fix.

2 The Model Landscape 2026 7

OpenAI: GPT-4.1, GPT-4.1-mini, o1, o3 Model selection isn't about picking the most powerful option: it's about matching inference cost, latency budget, and reasoning depth to your specific problem.

Anthropic: Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 Claude models differ by reasoning depth and speed, not just cost: choose based on whether your task needs extended thinking or real-time response.

Google: Gemini 2.5 Pro, Flash, Flash-Lite Gemini's three-tier lineup trades cost and latency against reasoning depth: choose based on whether you need thinking or throughput.

Meta: Llama 4 Scout, Maverick, Llama 3.3 70B Llama's open-weight models trade proprietary model moats for deployment flexibility and cost predictability: a fundamentally different business model that changes where you can run inference and who controls your data.

Mistral: Mistral Large, Small, open source Mistral offers a middle ground between proprietary models and pure open-source: you must choose based on deployment constraints (cloud vs. on-prem), cost sensitivity, and latency requirements, not just capability.

DeepSeek: R1 reasoning, V3 efficiency DeepSeek R1 excels at complex reasoning tasks but costs less; V3 prioritizes speed: choose based on latency tolerance, not just capability.

Specialized models: code, vision, audio, embedding Different domains require fundamentally different model architectures: picking the wrong one wastes months of engineering and budget.

3 Evaluation Dimensions 7

Quality: benchmark scores and real tests A model that scores 95% on a benchmark can fail catastrophically in production because benchmarks measure the wrong thing.

Cost: per million tokens input and output Token pricing directly determines whether your AI system is economically viable: and input/output asymmetry means your cost model breaks if you're not careful.

Latency: time to first token Time to first token (TTFT) determines whether your AI product feels interactive or broken: and it's determined before you write a single line of code.

Context window size Context window size determines what information your model can see at once: pick wrong and you either burn money or miss critical data.

Feature set: tools, vision, structured output Model capability selection is not about picking the smartest AI: it's about matching model features to your domain's data format, compliance constraints, and operational reality.

Rate limits and availability The model you choose is only useful if you can call it at the scale and frequency your application demands.

Data Privacy and Compliance Your model choice is legally locked before you write any code: compliance requirements eliminate 70% of vendor options before technical evaluation begins.

4 Task-Model Matching 7

Coding: GPT-4.1, Claude Sonnet, Gemini 2.5 Pro The model you pick determines your cost, latency, reasoning quality, and vendor lock-in risk: choose based on your actual workload, not hype.

Reasoning and math: o3, DeepSeek R1, QwQ Reasoning models solve math, code, and logic problems that language models fail on, but they're slower and more expensive: use them only when you actually need step-by-step reasoning.

Creative writing: Claude Opus, GPT-4.1 Claude Opus excels at long-form narrative consistency; GPT-4.1 excels at stylistic variety: choose based on whether your application demands coherence or creative range.

Document analysis: long context models Long context models let you process entire contracts, medical records, or regulatory filings at once, but the cost and latency trade-offs depend heavily on your document type and compliance requirements.

Classification: small models Haiku/GPT-4o-mini Small models handle 80% of classification tasks at 1/10th the cost and latency: but you must understand their actual boundaries before choosing them.

Vision tasks: GPT-4.1, Gemini 2.5 Pro, Llama 4 Vision models have moved from research to production, but model choice depends on image resolution, latency budget, and whether you need reasoning or just classification.

Structured extraction: top performers Structured extraction is where AI proves immediate ROI in regulated industries: but only certain models handle the compliance and reliability constraints required.

5 Cost Optimization Strategy 7

Routing: different models per task Route different tasks to different models based on cost, latency, and domain constraints: not every task needs GPT-4.1, and not every domain allows closed-source APIs.

Small models for simple tasks Small models (3B–7B parameters) solve 70% of business problems at 10% of the cost and latency of frontier models, but only if the task is genuinely simple.

Large models for complex tasks Large language models solve genuinely hard problems: document classification, contract analysis, clinical reasoning: but they cost 10–100x more per token than small models, so the business case requires either high-value outputs, regulatory requirements, or both.

Caching identical requests Caching prevents redundant API calls to the same model for identical inputs, cutting costs by 40–70% and latency by 80% in production systems.

Batch API discounts Batch APIs offer 50% cost savings for non-real-time workloads, but require fundamental changes to your architecture and latency expectations.

Open source for high-volume Open source models at scale require infrastructure investment upfront but eliminate per-token costs that become catastrophic at high volume.

Total cost of ownership The cheapest model on API pricing is almost never the cheapest model in production.

6 Evaluating Models for Your Use Case 7

Benchmark vs real-task evaluation A model that scores 95% on a benchmark can fail catastrophically on your actual data: and you won't know until production hits.

Creating a Domain Test Set A domain test set is not a random sample of your data: it's a deliberate snapshot of the real-world conditions your model will face, built with your domain experts, not your data scientists alone.

Blind A/B evaluation A/B tests reveal model bias when neither evaluators nor data scientists know which model produced which output.

LLM-as-judge comparison LLM-as-judge (using an LLM to score outputs from another LLM) works well for preference rankings but fails catastrophically for objective correctness in regulated domains.

Human preference testing Human preference testing is how you validate that a model actually produces outputs humans want before you deploy it to real users.

Statistical Significance A model metric that looks good in isolation is worthless if you can't prove the improvement wasn't random luck.

Continuous re-evaluation Models degrade in production faster than you expect: you need systematic monitoring and governance, not just deployment.

7 Vendor Risk and Lock-in 7

API Changes and Deprecations in Model Selection Selecting an AI model based on current API availability is a production debt guarantee: you must architect for API instability as your baseline assumption.

Price changes over time Model pricing changes continuously across vendors, APIs, and deployment modes: selecting a model locks you into cost assumptions that may not survive production.

Model updates changing behavior Model providers push updates that silently change inference behavior: you must version-lock models in production and test before upgrading.

Multi-provider strategy No single AI vendor owns your production system: architect for portability and redundancy from day one.

LiteLLM for abstraction LiteLLM is a vendor abstraction layer that lets you swap between competing LLM APIs without rewriting application code: critical when your preferred model is unavailable, too expensive, or blocked by compliance.

OpenAI-compatible APIs OpenAI-compatible APIs let you swap models without rewriting code: but you still have to choose which model to use, and that choice determines cost, latency, and reliability in ways the API abstraction hides.

Fallback configuration Every production model selection decision requires a documented fallback: what happens when your primary model fails, is unavailable, or produces unreliable output.

Intermediate

49 lessons

1 Reasoning Models Selection 7

When reasoning models are worth the cost Reasoning models (o3, Claude Opus extended thinking) cost 10-40× more per token but solve specific high-stakes problems where traditional models fail: and the domain determines whether that ROI exists.

o1 vs o3: capability and cost comparison o3 costs 3-4x more but solves reasoning problems o1 cannot; choose based on task complexity, not brand loyalty.

DeepSeek R1: open source reasoning DeepSeek R1 shifts reasoning workloads from proprietary inference APIs to self-hosted open models, reducing vendor lock-in and enabling compliance-sensitive deployments: but reasoning tokens cost 4-6x standard inference.

QwQ-32B: local reasoning option QwQ-32B enables on-premise reasoning workflows for regulated industries where API calls create compliance friction: but only if your infrastructure can handle 32GB memory and latency isn't a constraint.

Gemini 2.5 Pro thinking mode Gemini 2.5 Pro's thinking mode trades latency for reasoning depth: understand when that tradeoff wins in your domain and when it costs you.

Reasoning models for math and code Reasoning models (o3, Claude Opus extended thinking) solve symbolic problems that token-prediction models fail on, but cost 10–100x more and require architectural redesign around latency.

When standard models beat reasoning models Reasoning models (o3, extended thinking) cost 5–50x more per token and add 10–60 second latencies: standard models win most production systems when speed, cost, or user experience matter more than perfect reasoning.

2 Open Source Model Selection 7

Llama 4 Scout: MoE, 10M context Llama 4 Scout's mixture-of-experts architecture trades inference speed for cost efficiency, but 10M context windows demand careful orchestration of memory, latency, and regulatory compliance in production.

Llama 3.3 70B: production quality open source Llama 3.3 70B is the first open-source model that matches closed-source performance at enterprise scale, eliminating vendor lock-in as a business risk: but only if you deploy it on infrastructure you control.

Qwen3: multilingual and coding Qwen3 is the only open-weight model with genuine parity on non-English code and documentation, making it the strategic choice for global engineering teams: but only if you can run it yourself or negotiate vendor pricing.

Mistral open source family Mistral's open models let you control inference costs and data residency, but you inherit deployment complexity that closed APIs hide from you.

DeepSeek V3: efficiency leader DeepSeek V3 achieves GPT-4-class reasoning at 1/10th the inference cost, forcing architects to rethink the economics of model selection: but latency and vendor lock-in create new tradeoffs.

Gemma 3: Google open weight Gemma 3 is production-grade for cost-sensitive inference, but requires your own infrastructure: avoiding vendor lock-in while accepting operational burden.

Choosing open source by use case Open source model selection depends on regulatory requirements, inference latency, cost constraints, and whether you can run inference on your own infrastructure: not just model benchmarks.

3 Vision Model Selection 7

GPT-4.1 vision capabilities GPT-4.1 vision solves document intelligence and visual inspection at enterprise scale, but fails on real-time video, medical imaging diagnostics, and tasks requiring spatial reasoning beyond 2D layout.

Gemini 2.5 Pro: When Multimodal Inference Changes Architecture Gemini 2.5 Pro's native video understanding and document processing capabilities eliminate entire pipeline stages: but only if you architect for concurrent multimodal input, not sequential image extraction.

Claude Sonnet Vision: When to Use Multi-Modal Analysis in Production Claude Sonnet's vision capability is production-ready for document analysis and compliance workflows, but requires careful integration planning around latency, cost per image, and human review checkpoints that teams consistently underestimate.

Llama 4 native multimodal Llama 4's native multimodal capabilities eliminate the need for separate vision encoders, but production deployment requires careful consideration of cost, latency, and token efficiency trade-offs.

Document Understanding: Comparing OCR, LLM Extraction, and Specialized Models Document understanding requires choosing between three fundamentally different approaches, each with hard limits in accuracy, cost, and compliance that no amount of engineering can overcome.

Chart and graph analysis Chart and graph analysis requires different model architectures than text or images alone: and the compliance burden varies dramatically by industry.

Vision model cost comparison Vision model costs vary 1000x by vendor and deployment method: pick wrong and burn your budget before you prove value.

4 Embedding Model Selection 7

text-embedding-3-large vs small: When to choose each model text-embedding-3-large handles semantic complexity and rare use cases; text-embedding-3-small is production-grade for 99% of retrieval systems and costs 5x less.

Cohere embed-v3: When to Use It Over OpenAI & Alternatives Cohere embed-v3 is the right choice when you need multilingual semantic search at scale without vendor lock-in to OpenAI, but only if your infrastructure can handle non-US data residency requirements.

Voyage AI embeddings Voyage AI embeddings are purpose-built for semantic search and RAG in enterprise contexts, but vendor lock-in and cost-per-token trade-offs require deliberate architecture decisions.

BGE-M3: When to Use Open Source Embeddings Instead of Proprietary APIs BGE-M3 gives you production-grade multilingual embeddings you can self-host: eliminating API costs and data residency concerns, but requiring infrastructure ownership.

Multilingual embedding requirements Multilingual embeddings require architectural decisions about tokenization, alignment, and vector space quality that depend on your language pairs and compliance context: not all embedding models are equal across languages.

Embedding dimension vs quality Higher embedding dimensions improve semantic fidelity but multiply inference cost, latency, and storage: the optimization you must get right before scaling to production.

Embedding model benchmarks: MTEB MTEB scores are necessary but insufficient: your embedding model must match your retrieval architecture and query distribution, not just benchmark leaderboards.

5 Building a Model Router 7

Intent-based routing implementation Intent-based routing is how you route incoming requests to the right specialized model: but the routing decision itself is often more expensive and fragile than the downstream model.

Complexity-based routing Route requests to different models based on input complexity to optimize cost and latency without sacrificing quality.

Cost-based routing logic Route inference requests to different models based on cost-per-token and latency requirements, not model capability alone: the difference between a $500/month and $50,000/month LLM bill.

A/B testing routing decisions A/B testing model routing requires statistical rigor and careful traffic allocation: not just picking the winner after 100 requests.

Fallback chain configuration A fallback chain isn't optional insurance: it's the difference between a pilot that scales and a production incident that kills user trust.

Monitoring routing decisions You cannot optimize what you don't measure: routing decisions require instrumentation before they become intelligent.

Router evaluation methodology Routing isn't about picking the best model: it's about matching request complexity to cost and latency constraints in production, which requires benchmarking against your actual traffic distribution, not synthetic tests.

6 Advanced Selection Criteria 7

Fine-tuning availability comparison Not all production models support fine-tuning, and the ones that do have wildly different compliance, cost, and latency profiles: understanding which model families support it is a prerequisite to architecture decisions, not an afterthought.

Batch API availability by provider Batch APIs are the production workhorse for cost-sensitive AI workloads, but availability and latency guarantees vary drastically by provider and use case.

Streaming reliability comparison Streaming LLM outputs are faster to perceive but architecturally fragile: you must choose between user experience and guaranteed consistency based on your domain's tolerance for partial failures.

Rate limit scalability Selecting a model means selecting a rate limit ceiling: and that ceiling determines your entire infrastructure and cost model.

SLA and uptime guarantees Your model's availability SLA must match or exceed the SLA of the system it serves: and almost no AI vendor guarantees what your business actually needs.

Audit logging features Audit logging isn't optional infrastructure: it's the enforceability mechanism that determines whether your model deployment survives regulatory scrutiny.

Compliance certifications Your model choice is not just technical: it is a regulatory decision that determines which industries you can serve and how long deployment takes.

7 Evaluation Methodology 7

Golden test set construction A golden test set is your unflinching record of ground truth: built before model selection, owned by domain experts, and never touched by training pipelines.

Task-specific metrics selection Choosing the wrong metric for your task will silently ship a model that optimizes for the wrong outcome: and your business won't know until it's in production.

Human preference evaluation Preference data is expensive to collect and easily biased: you must architect evaluation systems that surface whose preferences are actually being measured before your model learns them.

LLM-as-judge Setup LLM-as-judge systems require architectural safeguards, oracle selection, and human oversight chains that differ fundamentally from standard inference workloads.

Statistical Significance in Production Model Selection A model with 94% accuracy on your test set might be statistically identical to a 91% baseline: and shipping it anyway is how you waste millions.

Production traffic evaluation Evaluating model performance on live traffic is fundamentally different from test sets: you must measure what users actually experience, not what your validation data promised.

Continuous comparison framework You cannot pick the best model once: you must compare continuously across production, and your framework must survive model drift, data distribution shift, and regulatory audits.

Advanced

49 lessons

1 Capability Assessment Framework 7

Task specification: what exactly Most model selection failures happen before you ever touch code: they start with stakeholders and engineers using different definitions of 'success.'

Benchmark-informed selection Benchmarks alone don't predict production performance: you must benchmark against your actual data distribution, latency requirements, and cost constraints, not leaderboard results.

Real task evaluation: not just benchmarks A model's benchmark score means nothing until you measure it against your actual production task, latency budget, and regulatory constraints: and that measurement is harder than the model selection itself.

Qualitative evaluation methodology Qualitative evaluation is not about gut feel: it's a structured methodology for assessing whether a model solves the actual business problem before you measure accuracy.

Side-by-side comparison: How domain compliance reshapes model choice The model that works best technically often cannot be deployed legally: compliance constraints eliminate options before benchmarks even matter.

Context window requirements Context window size is not a feature: it's an architectural constraint that determines whether your entire system works or fails in production.

Multimodal requirements Multimodal models solve a specific architectural problem: but only if your data pipeline, latency budget, and compliance boundary can support them.

2 Cost and Latency Matrix 7

Price comparison: input and output tokens 2026 Token pricing asymmetry and per-request minimums now dominate model selection more than raw capability: and you must account for this in your cost model before committing to a vendor.

Latency comparison by task type Different task types have fundamentally different latency budgets, and choosing the wrong model for your task type burns money and kills user experience faster than any other selection mistake.

Throughput requirements Throughput determines model choice more than accuracy: most engineers optimize the wrong metric and select models that fail under production load.

Cost at scale calculation The model that wins in a lab costs 10–100x more at production scale: you must calculate true per-inference cost before committing to architecture.

Latency SLA requirements Your model choice is determined by latency SLA first, accuracy second: violate SLA once in production and your system fails regardless of F1 score.

Cost-quality tradeoff curves The model you can afford is rarely the model that performs best: and mapping that tradeoff requires understanding your domain's cost structure, latency constraints, and where human review is mandatory.

Total cost of ownership model The model you choose is not the largest cost; infrastructure, compliance, and human review are: and they scale non-linearly.

3 Open Source vs Proprietary Decision 7

When open source wins Open source models beat proprietary APIs when you need latency guarantees, cost predictability at scale, or legal certainty over data residency: not because they're cheaper upfront, but because they're the only option that fits the constraint.

When proprietary is required Proprietary models aren't a luxury choice: they're a compliance and liability requirement in regulated industries, and choosing open-source when you need proprietary creates legal and operational risk that no engineering excellence can fix.

Self-hosted cost analysis Self-hosting only makes financial sense if your inference volume is predictable, your latency SLA is sub-100ms, and you've modeled the true cost of ops ownership.

Inference infrastructure for OSS Open-source model inference requires fundamentally different infrastructure decisions than closed-API models: compliance, cost, and latency trade-offs are not optional.

Fine-tuning OSS vs proprietary Fine-tuning proprietary models locks you into vendor ecosystems and compliance frameworks; OSS gives control but saddles you with infrastructure, security, and regulatory certification responsibility.

Data privacy advantage of OSS Open-source models run on your infrastructure eliminate data residency violations and give you legal control that closed APIs never can: but only if you architect it correctly.

Support and Community: The Hidden Cost of Model Selection Your model choice locks you into a vendor's support ecosystem: and that ecosystem becomes your technical ceiling when things break in production.

4 Enterprise Model Governance 7

Approved Model Registry In regulated industries, you don't pick models: your compliance and procurement teams do, and the registry is the contract that binds all of you.

Model risk assessment framework Model risk assessment is not a compliance checkbox: it's the framework that determines whether your model can be deployed, who approves it, and when it can fail without destroying the organization.

Legal review for model terms Model selection is constrained by vendor terms-of-service, not just capability: and your legal team must review before architecture decisions are locked in.

Data Processing Agreements A Data Processing Agreement (DPA) is the legal contract that determines whether your model can legally touch customer data at all: and which vendor you can use to run it.

Compliance mapping per model Not every model can legally or technically run in every industry: compliance constraints eliminate choices before performance benchmarks matter.

Model change management process In regulated industries, deploying a new model is a change control event, not a software deployment: and the approval gate happens before engineering, not after.

Vendor assessment criteria Vendor selection is not about model performance: it's about compliance, data residency, audit trails, and whether they'll still exist in 18 months.

5 Multi-Provider Architecture 7

API abstraction with LiteLLM LiteLLM solves the vendor lock-in problem that kills production AI systems: but only if you abstract at the right architectural layer.

Provider-agnostic prompting Build prompt abstractions that swap LLM providers without rewriting your application logic: critical when vendor lock-in risks breach your SLA or compliance posture.

Migration complexity assessment Migrating from legacy rule-based systems to ML models is not a technical problem: it's an organizational and compliance problem that engineers almost always underestimate.

Monitoring for better models Model monitoring isn't telemetry: it's the feedback loop that determines whether your production model is still solving the business problem it was trained for.

Quarterly review cadence Model drift and regulatory exposure compound monthly; quarterly reviews are the minimum governance cadence to avoid catastrophic failure and compliance violations.

Contractual protection clauses Your model selection must satisfy vendor liability, indemnification, and data handling clauses before deployment: or you inherit unlimited legal exposure.

Building provider-agnostic systems Designing systems that swap AI providers without rewriting inference logic saves you from single-vendor lock-in when regulations, costs, or capabilities shift.

6 Selection for Specific Industries 7

Healthcare: HIPAA, accuracy requirements In healthcare, model accuracy is not a performance metric: it's a liability surface that HIPAA, FDA oversight, and malpractice law make you personally responsible for.

Finance: compliance, explainability In regulated finance, model selection is constrained by explainability requirements and regulatory approval timelines: not just accuracy.

Legal: citation accuracy, hallucination risk LLMs hallucinate case citations and statutory references with confidence: this creates malpractice liability that no architecture pattern fully eliminates.

Code generation: benchmark-driven selection Code generation models must be evaluated on your codebase's actual patterns and compliance boundaries, not generic benchmarks: and this requires building a custom evaluation framework before choosing vendors.

Customer support: cost and latency focus In customer support, the model you choose is determined by your SLA (response time) and cost-per-interaction, not by accuracy alone: and those constraints eliminate most frontier models.

Creative: quality over cost In creative domains (copywriting, design, strategy), model quality directly impacts brand value and client retention: cost optimization often destroys the output you're trying to monetize.

Research: frontier model access Frontier model access requires vendor contracts, SLA negotiation, and inference cost modeling before you can even evaluate whether a cutting-edge model is actually the right choice for your problem.

7 Future-Proofing Model Selection 7

Tracking model release cadence Model release cadence is a business, legal, and technical constraint that determines which foundation models you can deploy in production: and when you can update them without breaking compliance.

Evaluating new models systematically Model evaluation isn't about benchmark scores: it's about measuring performance on your actual data distribution under your actual constraints, with explicit governance for when to switch.

Migration playbook for model upgrades Model migrations are operational events with regulatory, financial, and reputational consequences: they require staged rollouts, baseline metrics, and explicit sign-off from non-technical stakeholders before touching production.

Avoiding over-optimization for one model Optimizing a model for one vendor's infrastructure or API creates technical debt that resurfaces when models change, regulations tighten, or costs spike.

Building abstraction layers Abstraction layers isolate business logic from model volatility, turning model selection from a technical dead-end into a runtime decision.

Community intelligence: following benchmarks Public benchmarks are a starting point, not a destination: your domain constraints will disqualify 80% of top-ranked models.

Long-term provider relationship strategy Your model provider choice locks you into their roadmap, pricing, and compliance posture for 3–5 years; choose based on regulatory trajectory and contractual escape routes, not current API performance.