Azure OpenAI token limits explained
PREREQUISITES
Python 3.8+Azure OpenAI API keypip install openai>=1.0Azure OpenAI endpoint and deployment name
Setup
Install the official openai Python package and set environment variables for your Azure OpenAI API key, endpoint, and deployment name.
- Set
AZURE_OPENAI_API_KEYwith your Azure OpenAI key. - Set
AZURE_OPENAI_ENDPOINTwith your Azure OpenAI endpoint URL. - Set
AZURE_OPENAI_DEPLOYMENTwith your model deployment name.
pip install openai>=1.0 Step by step
This example demonstrates how to call Azure OpenAI's chat completion API with token limits in mind. The max_tokens parameter controls the maximum tokens in the response, while the total tokens (prompt + completion) must stay within the model's context window.
import os
from openai import AzureOpenAI
client = AzureOpenAI(
api_key=os.environ["AZURE_OPENAI_API_KEY"],
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_version="2024-02-01"
)
def chat_with_token_limit():
response = client.chat.completions.create(
model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
messages=[{"role": "user", "content": "Explain token limits in Azure OpenAI."}],
max_tokens=500 # Limit output tokens to avoid exceeding context window
)
print("Response:", response.choices[0].message.content)
if __name__ == "__main__":
chat_with_token_limit() Response: Azure OpenAI token limits depend on the model's context window size, which includes both prompt and completion tokens. For example, models like gpt-4o support up to 8,192 tokens, while others may support up to 32,000 tokens. Use the max_tokens parameter to control output length and avoid exceeding these limits.
Common variations
You can adjust token limits by changing the max_tokens parameter or switching to models with larger context windows. Azure OpenAI models like gpt-4o support 8k tokens, while newer deployments may support 32k tokens. Async calls and streaming are also supported but require managing tokens carefully to avoid truncation.
import asyncio
async def async_chat():
response = await client.chat.completions.acreate(
model=os.environ["AZURE_OPENAI_DEPLOYMENT"],
messages=[{"role": "user", "content": "Explain token limits asynchronously."}],
max_tokens=300
)
print("Async response:", response.choices[0].message.content)
if __name__ == "__main__":
asyncio.run(async_chat()) Async response: Azure OpenAI token limits apply to both prompt and completion tokens combined. Always check your model's max context window and set max_tokens accordingly to prevent errors.
Troubleshooting
- Error: Token limit exceeded: Reduce
max_tokensor shorten prompt messages. - Unexpected truncation: Ensure total tokens (prompt + max_tokens) fit within model context window.
- Check deployment model: Different Azure deployments have different token limits; verify your model's specs in Azure portal.
Key Takeaways
- Azure OpenAI token limits include both prompt and completion tokens within the model's context window.
- Use the max_tokens parameter to control output length and avoid exceeding token limits.
- Different Azure OpenAI deployments have varying token limits; verify your model's max tokens in the Azure portal.