How to beginner · 3 min read

How to route queries to cheaper models

Quick answer

Use conditional logic in your application to route queries to different model parameters based on cost or complexity. For example, send simple queries to cheaper models like gpt-4o-mini and complex ones to premium models like gpt-4o by dynamically selecting the model in your API calls.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the OpenAI Python SDK and set your API key as an environment variable for secure authentication.

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (50 kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use Python to route queries based on your criteria. This example routes short queries to gpt-4o-mini (cheaper) and longer queries to gpt-4o (premium).

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def route_query(prompt: str) -> str:
    # Route based on prompt length
    if len(prompt) < 50:
        model = "gpt-4o-mini"  # Cheaper model
    else:
        model = "gpt-4o"       # Premium model

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Example usage
print(route_query("Hello, how are you?"))
print(route_query("Explain the theory of relativity in detail."))

output

Hello! I'm doing well, thank you.

The theory of relativity, developed by Albert Einstein, consists of two main parts: special relativity and general relativity...

Common variations

You can route queries based on other factors like user subscription level, query complexity, or cost budget. Async calls and streaming responses are also supported by the OpenAI SDK.

python

import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def async_route_query(prompt: str) -> str:
    model = "gpt-4o-mini" if len(prompt) < 50 else "gpt-4o"
    stream = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    result = ""
    async for chunk in stream:
        result += chunk.choices[0].delta.content or ""
    return result

# Run async example
async def main():
    short_response = await async_route_query("Short question?")
    long_response = await async_route_query("Explain quantum computing in detail.")
    print(short_response)
    print(long_response)

asyncio.run(main())

output

Yes, I can help with that.

Quantum computing is a field of computing focused on developing computer technology based on the principles of quantum theory...

Troubleshooting

If you get authentication errors, ensure your OPENAI_API_KEY environment variable is set correctly.
If responses are slow, check your model choice and consider routing more queries to cheaper, faster models.
For unexpected errors, verify you are using the latest OpenAI SDK and correct model names like gpt-4o and gpt-4o-mini.

✅

Key Takeaways

Use conditional logic to select cheaper or premium models dynamically based on query needs.
Cheaper models like gpt-4o-mini reduce cost for simple queries without sacrificing basic quality.
The OpenAI SDK supports both synchronous and asynchronous routing with streaming responses.
Always verify environment variables and model names to avoid authentication or usage errors.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗