Cheat Sheet intermediate · 8 min read

Azure OpenAI Cheat Sheet — API & Models Reference — Azure Op

version 2024-12

Deploy OpenAI models on Azure infrastructure

AZURE_OPENAI_API_KEYAZURE_OPENAI_ENDPOINTAZURE_OPENAI_API_VERSION

install pip install azure-openai

core imports

python

from azure.openai import AzureOpenAI
from openai import AzureOpenAI
import os

Mental model

OpenAI API wrapper that routes requests to Azure infrastructure instead of OpenAI directly.

Like renting an apartment (OpenAI) vs. buying a house in your neighborhood (Azure). Same appliances, your infrastructure, your compliance controls.

Common Patterns

01 Basic Authentication & Chat

Simple chat completions with Azure deployment

python

from azure.openai import AzureOpenAI
import os

client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2024-10-01-preview",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"]
)

response = client.chat.completions.create(
    model="gpt-4o",  # deployment name, not model name
    messages=[
        {"role": "system", "content": "You are helpful."},
        {"role": "user", "content": "Explain RAG."}
    ]
)

print(response.choices[0].message.content)

output Retrieval-Augmented Generation (RAG) combines...

Model parameter must match your Azure deployment name, NOT the actual model name. If you deployed 'gpt-4o' as 'my-gpt4o-deployment', use model='my-gpt4o-deployment'.

02 Streaming Responses

Real-time token streaming for chat

python

response = client.chat.completions.create(
    model="gpt-4o-deployment",
    messages=[{"role": "user", "content": "Write a haiku"}],
    stream=True,
    temperature=0.7
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

output

Golden leaves fall fast
Autumn whispers in the wind
Winter sleeps below

Streaming responses don't include usage tokens. If you need token counts, make two calls: one streamed, one non-streamed with include_usage=True.

03 Text Embeddings

Generate embeddings for RAG/search

python

response = client.embeddings.create(
    model="text-embedding-3-small",  # your deployment name
    input="The quick brown fox"
)

embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

output

Embedding dimensions: 512
First 5 values: [0.0234, -0.156, 0.899, ...]

Text-embedding-3-small and text-embedding-3-large have different output dimensions (512 vs 3072). Mixing them in a database will break similarity search.

04 Managed Identity (No API Key)

Authenticate via Azure AD without hardcoding keys

python

from azure.identity import DefaultAzureCredential
from azure.openai import AzureOpenAI

credential = DefaultAzureCredential()
client = AzureOpenAI(
    api_version="2024-10-01-preview",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_ad_token_provider=credential.get_token
)

response = client.chat.completions.create(
    model="gpt-4o-deployment",
    messages=[{"role": "user", "content": "Hello"}]
)

output Authentication successful via Azure AD.

DefaultAzureCredential tries multiple auth methods in order. In local dev, it may pick the wrong credential (e.g., CLI instead of env vars). Explicitly specify the credential type for predictability.

05 Vision (GPT-4 Vision Deployment)

Analyze images with Azure OpenAI

python

import base64
import httpx

image_url = "https://example.com/image.jpg"
image_data = base64.standard_b64encode(httpx.get(image_url).content).decode("utf-8")

response = client.chat.completions.create(
    model="gpt-4-vision-deployment",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

output The image shows a sunset over mountains with...

Vision requires specific deployment regions on Azure. Not all regions support gpt-4-vision. Check Azure regional availability before deploying.

Chat Completions Parameters

client.chat.completions.create()

Parameter	Type	Default	Notes
`model`	str	required	Azure deployment name (NOT 'gpt-4o'). Exact deployment name from Azure portal.
`messages`	list[dict]	required	Array of role/content dicts. Roles: 'system', 'user', 'assistant', 'function'.
`temperature`	float	1.0	0.0-2.0. Lower = deterministic, higher = creative. 0.0 = always same output.
`top_p`	float	1.0	0.0-1.0. Nucleus sampling. Lower = narrower token choices.
`max_tokens`	int	none	Max output tokens. If unset, uses model context limit minus input tokens.
`stream`	bool	False	True = streaming tokens. False = full response at once.
`frequency_penalty`	float	0.0	-2.0 to 2.0. Higher = discourages repeating tokens.
`presence_penalty`	float	0.0	-2.0 to 2.0. Higher = discourages new topics.

Core API Methods

Method / Property	Description	Returns
`client.chat.completions.create()`	Generate chat completions. Supports streaming, vision, function calling.	ChatCompletion or Iterator[ChatCompletionChunk] if stream=True
`client.embeddings.create()`	Generate text embeddings for semantic search and RAG.	CreateEmbeddingResponse with data[0].embedding (list of floats)
`client.completions.create()`	Legacy text completion (not recommended). Use chat.completions instead.	Completion
`client.images.generate()`	Generate images from text. Requires dall-e-3 deployment.	ImagesResponse with data[0].url

Common Errors & Fixes

01 AuthenticationError: Invalid credentials

Cause: Missing or invalid AZURE_OPENAI_API_KEY or AZURE_OPENAI_ENDPOINT.

Fix:

python

Verify env vars are set:

import os
print(os.environ.get('AZURE_OPENAI_API_KEY'))
print(os.environ.get('AZURE_OPENAI_ENDPOINT'))

Or explicitly pass in:

client = AzureOpenAI(
    api_key="your-key",
    api_version="2024-10-01-preview",
    azure_endpoint="https://your-resource.openai.azure.com/"
)

02 NotFoundError: Model 'gpt-4o' not found

Cause: Using model name instead of deployment name. Or deployment doesn't exist.

Fix:

python

Use your Azure deployment name:

# WRONG: model='gpt-4o'
# RIGHT:
client.chat.completions.create(
    model="my-gpt4o-deployment",  # Name from Azure portal
    messages=[...]
)

To find deployments:
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient

03 RateLimitError: 429 Quota exceeded

Cause: Exceeded deployment token-per-minute (TPM) quota or request limits.

Fix:

python

Implement exponential backoff:

import time

for attempt in range(5):
    try:
        response = client.chat.completions.create(...)
        break
    except RateLimitError:
        wait = 2 ** attempt
        print(f'Rate limited. Waiting {wait}s')
        time.sleep(wait)

Or increase deployment quota in Azure portal > Quotas > Increase.

04 InvalidRequestError: context_length_exceeded

Cause: Input + output tokens exceed model's context window.

Fix:

python

Check token count before sending:

import tiktoken

enc = tiktoken.get_encoding('cl100k_base')
tokens = enc.encode(messages_text)
print(f'Token count: {len(tokens)}')

Or use max_tokens to limit output:

response = client.chat.completions.create(
    model='gpt-4o-deployment',
    messages=[...],
    max_tokens=500  # Limit output
)

Azure OpenAI vs OpenAI API

Feature	Azure OpenAI	OpenAI API
Cost Model	Pay-per-token, regional pricing	Pay-per-token, global pricing
Compliance	FedRAMP, SOC 2, ISO, HIPAA-eligible	Standard SLAs, no FedRAMP
VPC/Network	VNet integration, private endpoints	Public API only
Quota Control	Per-deployment TPM limits in Azure	Account-level rate limits
Model Selection	Deploy specific model versions	Always latest stable version
Authentication	API key or Azure AD managed identity	API key only
Data Residency	Data stays in Azure region	Data stored in OpenAI US infrastructure

Production Gotchas

⚠ API Version Mismatch Breaks Features

Azure rotates API versions quarterly. Function calling, vision, and tool use require specific api_version strings. Using api_version='2024-02-15-preview' on a feature that requires '2024-10-01-preview' will silently fail or return errors. Always pin to the exact version your feature needs, and test after Azure updates.

⚠ Deployment Name ≠ Model Name

This is the #1 confusion. Your Azure deployment might be named 'gpt-4o-prod' but deploy the 'gpt-4o' model. When calling create(), use model='gpt-4o-prod' (deployment name), not model='gpt-4o' (model name). Swapping these causes 404 NotFoundError.

⚠ Token Limits Vary by Region

TPM (tokens-per-minute) quotas are region-specific. East US may have 40K TPM while Central US has 10K. Deploying to a quota-limited region under production load causes 429 RateLimitError. Monitor quota utilization and scale regions accordingly.

⚠ Streaming Doesn't Include Usage Stats

When stream=True, the final chunk won't include usage (prompt_tokens, completion_tokens). If you need token counts, either disable streaming or make a separate non-streamed call. This matters for cost tracking.

⚠ Vision Deployments Require Specific Regions

gpt-4-vision is only available in select regions (East US, West Europe, etc.). Deploying in an unsupported region fails silently. Verify regional availability in Azure docs before deploying vision workloads.

Verified 2026-04 · v2024-12-01 · gpt-4o, gpt-4.1, text-embedding-3-small, text-embedding-3-large

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.