Rate limits: what they are and where to find them
Why this matters
Rate limits prevent your code from hammering the API and protect OpenAI's infrastructure. If you don't know your limits, you'll ship code that silently fails in production when traffic spikes, and your error handling won't distinguish between 'API is down' and 'you're rate limited'.
Explanation
OpenAI enforces two types of limits on your API key: requests per minute (RPM) and tokens per minute (TPM). Your limit depends on your account tier (Free, Pay-As-You-Go, or Enterprise), your spending history, and which model you're calling. Free tier accounts get the lowest limits (often 3 RPM for chat completions); accounts with paid billing history get higher limits.
When you hit a rate limit, the OpenAI API returns a 429 Too Many Requests HTTP status code with a Retry-After header that tells you how many seconds to wait before retrying. The Python SDK will raise an RateLimitError exception, which is a subclass of APIError. The SDK does not automatically retry: you must handle this in your code.
You find your rate limits in two places: (1) the Usage page in your OpenAI dashboard shows current spend and your tier, from which you can infer your limits; (2) every API response includes rate limit headers that tell you exactly how many requests and tokens you have remaining in the current minute window. These headers are invisible by default in the SDK, but you can access them from the response object.
Request code
import os
from openai import OpenAI
# API key is read from OPENAI_API_KEY environment variable
client = OpenAI()
# Make a simple API call and capture the response object
response = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role": "user", "content": "What is 2+2?"}
]
)
# Access rate limit headers from the response
print(f"Content: {response.choices[0].message.content}")
# Rate limit info is stored in the response object's _headers attribute
# (Note: this is implementation detail; it varies by SDK version)
if hasattr(response, '_headers'):
headers = response._headers
if 'x-ratelimit-limit-requests' in headers:
print(f"Requests per minute limit: {headers['x-ratelimit-limit-requests']}")
if 'x-ratelimit-remaining-requests' in headers:
print(f"Requests remaining: {headers['x-ratelimit-remaining-requests']}")
if 'x-ratelimit-limit-tokens' in headers:
print(f"Tokens per minute limit: {headers['x-ratelimit-limit-tokens']}")
if 'x-ratelimit-remaining-tokens' in headers:
print(f"Tokens remaining: {headers['x-ratelimit-remaining-tokens']}") Authentication
Visit https://platform.openai.com/account/api-keys to generate or retrieve your API key. Store it in an environment variable named OPENAI_API_KEY. The Python SDK will automatically read this when you call OpenAI(). Do not commit your key to version control.
Response shape
| Field | Description |
|---|---|
x-ratelimit-limit-requests | Integer: maximum number of requests allowed per minute |
x-ratelimit-remaining-requests | Integer: number of requests you can still make this minute |
x-ratelimit-limit-tokens | Integer: maximum number of tokens allowed per minute |
x-ratelimit-remaining-tokens | Integer: number of tokens you can still consume this minute |
Retry-After | Float (only in 429 responses): seconds to wait before retrying |
Field guide
x-ratelimit-remaining-tokens This is the field that matters most in production. If this drops below the number of tokens in your next request, you'll get rate-limited. Use it to implement backoff logic.
x-ratelimit-limit-tokens Your TPM ceiling. If you're on Free tier, this is usually 40,000 TPM for gpt-4, but varies by model. If this number is lower than you expected, your account tier may have changed or you haven't completed billing setup.
Retry-After Present only in 429 responses. The SDK does not read this automatically: you must catch RateLimitError and sleep for this duration before retrying. Ignoring this causes a retry storm.
Setup trap
The rate limit headers are not easily accessible in the standard response object: they're buried in the private _headers attribute. Many developers never realize they're available and instead try to manually track their usage without real data, leading to incorrect rate limit logic. Always check your SDK version documentation for the correct way to access response headers.
Cost
Hitting rate limits itself costs nothing, but the retry behavior you implement can cost money if done wrong. A retry storm with exponential backoff can send hundreds of requests in seconds. Set a maximum retry count (never more than 3-4) and a maximum total backoff time (never more than 60 seconds) to prevent runaway costs.
Rate limits
Free tier and Pay-As-You-Go accounts with low spending history hit rate limits frequently, especially with gpt-4 and gpt-4-turbo models (which have lower TPM limits). If you're testing with concurrent requests or batch processing, you will hit these limits. This is by design: OpenAI throttles low-trust accounts.
Common gotcha
The most common mistake: developers assume the SDK will automatically retry on rate limits. It doesn't. You'll get a RateLimitError exception, and if you don't catch it, your entire request chain fails silently. You must wrap API calls in try/except and implement exponential backoff yourself.
Error recovery
RateLimitErrorAuthenticationErrorInternalServerErrorExperienced dev note
Rate limit headers are per-model and per-account-tier, but they reset on a sliding 60-second window, not at a fixed time. A common production bug: developers log their TPM remaining and assume it's monotonic (always decreasing), then panic when it jumps back up. It jumps because the oldest request from 60 seconds ago just fell out of the window. Also, if you're using the async client (AsyncOpenAI), rate limit headers work the same way: don't make a separate synchronous call just to check limits.
Check your understanding
You have a Free tier account with a 40,000 TPM limit. Your first request uses 500 tokens. Your second request uses 2,000 tokens. The response header says x-ratelimit-remaining-tokens is 37,200. What rate limit window just passed, and can you safely make a third 10,000-token request immediately?
Show answer hint
Calculate what happened: 40,000 (limit) - 500 (first) - 2,000 (second) = 37,500 expected remaining, but you got 37,200. That's 300 tokens less: meaning a previous request from 50-60 seconds ago (300 tokens) just fell out the sliding window. The remaining 37,200 tokens is yours for the next 60 seconds, so a 10,000-token request is safe now, but you're at 70% capacity, leaving only 27,200 for the next minute. Decision depends on your request rate.