How to beginner · 3 min read

Azure OpenAI token limits explained

Quick answer

Azure OpenAI enforces token limits per request based on the deployed model and API version, typically ranging from 4,000 to 32,000 tokens for context windows. The total tokens include both prompt and completion tokens, and exceeding these limits results in errors. Use the max_tokens parameter to control output length and manage token usage efficiently.

PREREQUISITES

Python 3.8+
Azure OpenAI API key
pip install openai>=1.0
Azure OpenAI endpoint and deployment name

Setup

Install the official openai Python package and set environment variables for your Azure OpenAI API key, endpoint, and deployment name.

Set AZURE_OPENAI_API_KEY with your Azure OpenAI key.
Set AZURE_OPENAI_ENDPOINT with your Azure OpenAI endpoint URL.
Set AZURE_OPENAI_DEPLOYMENT with your model deployment name.

bash

pip install openai>=1.0

Step by step

This example demonstrates how to call Azure OpenAI's chat completion API with token limits in mind. The max_tokens parameter controls the maximum tokens in the response, while the total tokens (prompt + completion) must stay within the model's context window.

python

import os
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version="2024-02-01"
)

def chat_with_token_limit():
    response = client.chat.completions.create(
        model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
        messages=[{"role": "user", "content": "Explain token limits in Azure OpenAI."}],
        max_tokens=500  # Limit output tokens to avoid exceeding context window
    )
    print("Response:", response.choices[0].message.content)

if __name__ == "__main__":
    chat_with_token_limit()

output

Response: Azure OpenAI token limits depend on the model's context window size, which includes both prompt and completion tokens. For example, models like gpt-4o support up to 8,192 tokens, while others may support up to 32,000 tokens. Use the max_tokens parameter to control output length and avoid exceeding these limits.

Common variations

You can adjust token limits by changing the max_tokens parameter or switching to models with larger context windows. Azure OpenAI models like gpt-4o support 8k tokens, while newer deployments may support 32k tokens. Async calls and streaming are also supported but require managing tokens carefully to avoid truncation.

python

import asyncio

async def async_chat():
    response = await client.chat.completions.acreate(
        model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
        messages=[{"role": "user", "content": "Explain token limits asynchronously."}],
        max_tokens=300
    )
    print("Async response:", response.choices[0].message.content)

if __name__ == "__main__":
    asyncio.run(async_chat())

output

Async response: Azure OpenAI token limits apply to both prompt and completion tokens combined. Always check your model's max context window and set max_tokens accordingly to prevent errors.

Troubleshooting

Error: Token limit exceeded: Reduce max_tokens or shorten prompt messages.
Unexpected truncation: Ensure total tokens (prompt + max_tokens) fit within model context window.
Check deployment model: Different Azure deployments have different token limits; verify your model's specs in Azure portal.

Key Takeaways

Azure OpenAI token limits include both prompt and completion tokens within the model's context window.
Use the max_tokens parameter to control output length and avoid exceeding token limits.
Different Azure OpenAI deployments have varying token limits; verify your model's max tokens in the Azure portal.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.