client.chat.completions.create(): the core method
Why this matters
This is the fundamental building block for every LLM interaction in production. Understanding request structure, response format, and parameter tuning directly impacts latency, cost, and output quality across all downstream applications.
Explanation
What it does: client.chat.completions.create() sends a list of messages to OpenAI's GPT model and returns a single completion response. It's the standard synchronous way to interact with chat models like gpt-4o and gpt-4-turbo.
How it works: You provide a model name, a list of message objects (with roles like 'system', 'user', 'assistant'), and optional parameters like temperature and max_tokens. The SDK constructs an HTTP POST request, sends it to api.openai.com, waits for the response, and returns a ChatCompletion object containing the model's reply in .choices[0].message.content.
When to use it: Use this for single-turn or multi-turn conversations where you can wait for the full response. It's ideal for chatbots, Q&A systems, content generation, and analysis tasks where latency under 5 seconds is acceptable.
Request code
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
response = client.chat.completions.create(
model='gpt-4o',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Explain quantum entanglement in one sentence.'}
],
temperature=0.7,
max_tokens=100
)
print(response.choices[0].message.content) Authentication
Set your API key as an environment variable before running code:
export OPENAI_API_KEY='sk-...'
Or pass it directly to the client:
client = OpenAI(api_key='sk-...')
The SDK reads OPENAI_API_KEY automatically if no api_key is passed to OpenAI().
Response shape
| Field | Description |
|---|---|
id | String identifier for this completion (e.g., 'chatcmpl-8nB...') |
object | Always 'chat.completion' |
created | Unix timestamp when response was generated |
model | Model name that processed the request |
choices | List of completion objects |
choices[0].message.content | The text response from the model |
choices[0].message.role | Always 'assistant' |
choices[0].finish_reason | Why generation stopped: 'stop' (natural), 'length' (hit max_tokens), or 'tool_calls' |
usage.prompt_tokens | Tokens consumed by your input messages |
usage.completion_tokens | Tokens generated in the response |
usage.total_tokens | Sum of prompt and completion tokens |
Field guide
choices[0].message.content The actual text you need to display or process: this is where your answer lives
usage.total_tokens Multiply by the model's per-token cost ($0.03 per 1M input tokens for gpt-4o as of April 2026) to understand what this request cost you
finish_reason If it says 'length', your response was truncated: increase max_tokens or reduce prompt size
Setup trap
Setting os.environ['OPENAI_API_KEY'] after instantiating OpenAI() does NOT work. The SDK reads the environment variable at initialization time. Always set your environment variable or pass api_key to OpenAI() before making any API calls.
Cost
Each call costs based on input and output tokens. gpt-4o costs ~$0.03 per 1M input tokens and ~$0.12 per 1M output tokens (April 2026 pricing). A 1,000 token input + 500 token output costs roughly $0.00004. Test with small max_tokens values first to control spend while developing.
Rate limits
Free trial accounts are limited to 3 requests per minute. Paid accounts start at 3,500 requests per minute. If you hit a 429 status code, wait 30 seconds and retry. For high-volume applications, implement exponential backoff.
Common gotcha
Accessing the response incorrectly. Beginners write response.message.content instead of response.choices[0].message.content. The response is a wrapper object; the actual message is inside the choices list at index 0.
Error recovery
AuthenticationErrorRateLimitErrorAPIConnectionErrorBadRequestErrorExperienced dev note
Cache your system prompt and conversation history efficiently. Every character costs money. Use temperature=0 for deterministic outputs (classification, extraction) and temperature=1.0+ for creative tasks. Store responses and implement request deduplication: if the same user asks the same question twice, return your cached response instead of calling the API again.
Check your understanding
If you increase max_tokens from 100 to 500 but your model keeps finishing with finish_reason='length', what does that tell you about your input, and how would you fix it?
Show answer hint
finish_reason='length' means the model hit max_tokens before reaching a natural stop. The issue isn't your input: it's that your limit is too low. Either increase max_tokens or accept partial responses. If costs are a concern, try a shorter input prompt to reduce token usage.
openai.ChatCompletion.create() or openai.api_key = 'sk-...' from 0.x versions. This course uses 1.x exclusively.