Azure OpenAI Cheat Sheet — API & Models Reference — Azure Op
from azure.openai import AzureOpenAI
from openai import AzureOpenAI
import os OpenAI API wrapper that routes requests to Azure infrastructure instead of OpenAI directly.
Like renting an apartment (OpenAI) vs. buying a house in your neighborhood (Azure). Same appliances, your infrastructure, your compliance controls.
Common Patterns
from azure.openai import AzureOpenAI
import os
client = AzureOpenAI(
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version="2024-10-01-preview",
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"]
)
response = client.chat.completions.create(
model="gpt-4o", # deployment name, not model name
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Explain RAG."}
]
)
print(response.choices[0].message.content) Retrieval-Augmented Generation (RAG) combines... response = client.chat.completions.create(
model="gpt-4o-deployment",
messages=[{"role": "user", "content": "Write a haiku"}],
stream=True,
temperature=0.7
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True) Golden leaves fall fast
Autumn whispers in the wind
Winter sleeps below response = client.embeddings.create(
model="text-embedding-3-small", # your deployment name
input="The quick brown fox"
)
embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}") Embedding dimensions: 512
First 5 values: [0.0234, -0.156, 0.899, ...] from azure.identity import DefaultAzureCredential
from azure.openai import AzureOpenAI
credential = DefaultAzureCredential()
client = AzureOpenAI(
api_version="2024-10-01-preview",
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
azure_ad_token_provider=credential.get_token
)
response = client.chat.completions.create(
model="gpt-4o-deployment",
messages=[{"role": "user", "content": "Hello"}]
) Authentication successful via Azure AD. import base64
import httpx
image_url = "https://example.com/image.jpg"
image_data = base64.standard_b64encode(httpx.get(image_url).content).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4-vision-deployment",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]
}
]
)
print(response.choices[0].message.content) The image shows a sunset over mountains with... Chat Completions Parameters
client.chat.completions.create()
| Parameter | Type | Default | Notes |
|---|---|---|---|
model | str | required | Azure deployment name (NOT 'gpt-4o'). Exact deployment name from Azure portal. |
messages | list[dict] | required | Array of role/content dicts. Roles: 'system', 'user', 'assistant', 'function'. |
temperature | float | 1.0 | 0.0-2.0. Lower = deterministic, higher = creative. 0.0 = always same output. |
top_p | float | 1.0 | 0.0-1.0. Nucleus sampling. Lower = narrower token choices. |
max_tokens | int | none | Max output tokens. If unset, uses model context limit minus input tokens. |
stream | bool | False | True = streaming tokens. False = full response at once. |
frequency_penalty | float | 0.0 | -2.0 to 2.0. Higher = discourages repeating tokens. |
presence_penalty | float | 0.0 | -2.0 to 2.0. Higher = discourages new topics. |
Core API Methods
| Method / Property | Description | Returns |
|---|---|---|
client.chat.completions.create() | Generate chat completions. Supports streaming, vision, function calling. | ChatCompletion or Iterator[ChatCompletionChunk] if stream=True |
client.embeddings.create() | Generate text embeddings for semantic search and RAG. | CreateEmbeddingResponse with data[0].embedding (list of floats) |
client.completions.create() | Legacy text completion (not recommended). Use chat.completions instead. | Completion |
client.images.generate() | Generate images from text. Requires dall-e-3 deployment. | ImagesResponse with data[0].url |
Common Errors & Fixes
AuthenticationError: Invalid credentials Cause: Missing or invalid AZURE_OPENAI_API_KEY or AZURE_OPENAI_ENDPOINT.
Verify env vars are set:
import os
print(os.environ.get('AZURE_OPENAI_API_KEY'))
print(os.environ.get('AZURE_OPENAI_ENDPOINT'))
Or explicitly pass in:
client = AzureOpenAI(
api_key="your-key",
api_version="2024-10-01-preview",
azure_endpoint="https://your-resource.openai.azure.com/"
) NotFoundError: Model 'gpt-4o' not found Cause: Using model name instead of deployment name. Or deployment doesn't exist.
Use your Azure deployment name:
# WRONG: model='gpt-4o'
# RIGHT:
client.chat.completions.create(
model="my-gpt4o-deployment", # Name from Azure portal
messages=[...]
)
To find deployments:
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient RateLimitError: 429 Quota exceeded Cause: Exceeded deployment token-per-minute (TPM) quota or request limits.
Implement exponential backoff:
import time
for attempt in range(5):
try:
response = client.chat.completions.create(...)
break
except RateLimitError:
wait = 2 ** attempt
print(f'Rate limited. Waiting {wait}s')
time.sleep(wait)
Or increase deployment quota in Azure portal > Quotas > Increase. InvalidRequestError: context_length_exceeded Cause: Input + output tokens exceed model's context window.
Check token count before sending:
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
tokens = enc.encode(messages_text)
print(f'Token count: {len(tokens)}')
Or use max_tokens to limit output:
response = client.chat.completions.create(
model='gpt-4o-deployment',
messages=[...],
max_tokens=500 # Limit output
) Azure OpenAI vs OpenAI API
| Feature | Azure OpenAI | OpenAI API |
|---|---|---|
| Cost Model | Pay-per-token, regional pricing | Pay-per-token, global pricing |
| Compliance | FedRAMP, SOC 2, ISO, HIPAA-eligible | Standard SLAs, no FedRAMP |
| VPC/Network | VNet integration, private endpoints | Public API only |
| Quota Control | Per-deployment TPM limits in Azure | Account-level rate limits |
| Model Selection | Deploy specific model versions | Always latest stable version |
| Authentication | API key or Azure AD managed identity | API key only |
| Data Residency | Data stays in Azure region | Data stored in OpenAI US infrastructure |
Production Gotchas
Azure rotates API versions quarterly. Function calling, vision, and tool use require specific api_version strings. Using api_version='2024-02-15-preview' on a feature that requires '2024-10-01-preview' will silently fail or return errors. Always pin to the exact version your feature needs, and test after Azure updates.
This is the #1 confusion. Your Azure deployment might be named 'gpt-4o-prod' but deploy the 'gpt-4o' model. When calling create(), use model='gpt-4o-prod' (deployment name), not model='gpt-4o' (model name). Swapping these causes 404 NotFoundError.
TPM (tokens-per-minute) quotas are region-specific. East US may have 40K TPM while Central US has 10K. Deploying to a quota-limited region under production load causes 429 RateLimitError. Monitor quota utilization and scale regions accordingly.
When stream=True, the final chunk won't include usage (prompt_tokens, completion_tokens). If you need token counts, either disable streaming or make a separate non-streamed call. This matters for cost tracking.
gpt-4-vision is only available in select regions (East US, West Europe, etc.). Deploying in an unsupported region fails silently. Verify regional availability in Azure docs before deploying vision workloads.