API Beginner easy · 4 min

Rate limits: what they are and where to find them

What you will learn

OpenAI enforces request and token-per-minute limits on your API key: learn where to view them and what happens when you hit one.

Why this matters

Rate limits prevent your code from hammering the API and protect OpenAI's infrastructure. If you don't know your limits, you'll ship code that silently fails in production when traffic spikes, and your error handling won't distinguish between 'API is down' and 'you're rate limited'.

Skip if: You don't need to manually check rate limits if you're prototyping in a notebook with a single request every few seconds. You also don't need this if you're using a batch processing endpoint, which has different quotas. However, once you move to production with concurrent requests, this becomes critical.

Explanation

OpenAI enforces two types of limits on your API key: requests per minute (RPM) and tokens per minute (TPM). Your limit depends on your account tier (Free, Pay-As-You-Go, or Enterprise), your spending history, and which model you're calling. Free tier accounts get the lowest limits (often 3 RPM for chat completions); accounts with paid billing history get higher limits.

When you hit a rate limit, the OpenAI API returns a 429 Too Many Requests HTTP status code with a Retry-After header that tells you how many seconds to wait before retrying. The Python SDK will raise an RateLimitError exception, which is a subclass of APIError. The SDK does not automatically retry: you must handle this in your code.

You find your rate limits in two places: (1) the Usage page in your OpenAI dashboard shows current spend and your tier, from which you can infer your limits; (2) every API response includes rate limit headers that tell you exactly how many requests and tokens you have remaining in the current minute window. These headers are invisible by default in the SDK, but you can access them from the response object.

Request code

python

import os
from openai import OpenAI

# API key is read from OPENAI_API_KEY environment variable
client = OpenAI()

# Make a simple API call and capture the response object
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "user", "content": "What is 2+2?"}
    ]
)

# Access rate limit headers from the response
print(f"Content: {response.choices[0].message.content}")

# Rate limit info is stored in the response object's _headers attribute
# (Note: this is implementation detail; it varies by SDK version)
if hasattr(response, '_headers'):
    headers = response._headers
    if 'x-ratelimit-limit-requests' in headers:
        print(f"Requests per minute limit: {headers['x-ratelimit-limit-requests']}")
    if 'x-ratelimit-remaining-requests' in headers:
        print(f"Requests remaining: {headers['x-ratelimit-remaining-requests']}")
    if 'x-ratelimit-limit-tokens' in headers:
        print(f"Tokens per minute limit: {headers['x-ratelimit-limit-tokens']}")
    if 'x-ratelimit-remaining-tokens' in headers:
        print(f"Tokens remaining: {headers['x-ratelimit-remaining-tokens']}")

Authentication

Visit https://platform.openai.com/account/api-keys to generate or retrieve your API key. Store it in an environment variable named OPENAI_API_KEY. The Python SDK will automatically read this when you call OpenAI(). Do not commit your key to version control.

Response shape

Field	Description
`x-ratelimit-limit-requests`	Integer: maximum number of requests allowed per minute
`x-ratelimit-remaining-requests`	Integer: number of requests you can still make this minute
`x-ratelimit-limit-tokens`	Integer: maximum number of tokens allowed per minute
`x-ratelimit-remaining-tokens`	Integer: number of tokens you can still consume this minute
`Retry-After`	Float (only in 429 responses): seconds to wait before retrying

Field guide

x-ratelimit-remaining-tokens

This is the field that matters most in production. If this drops below the number of tokens in your next request, you'll get rate-limited. Use it to implement backoff logic.

x-ratelimit-limit-tokens

Your TPM ceiling. If you're on Free tier, this is usually 40,000 TPM for gpt-4, but varies by model. If this number is lower than you expected, your account tier may have changed or you haven't completed billing setup.

Retry-After

Present only in 429 responses. The SDK does not read this automatically: you must catch RateLimitError and sleep for this duration before retrying. Ignoring this causes a retry storm.

Setup trap

The rate limit headers are not easily accessible in the standard response object: they're buried in the private _headers attribute. Many developers never realize they're available and instead try to manually track their usage without real data, leading to incorrect rate limit logic. Always check your SDK version documentation for the correct way to access response headers.

Cost

Hitting rate limits itself costs nothing, but the retry behavior you implement can cost money if done wrong. A retry storm with exponential backoff can send hundreds of requests in seconds. Set a maximum retry count (never more than 3-4) and a maximum total backoff time (never more than 60 seconds) to prevent runaway costs.

Rate limits

Free tier and Pay-As-You-Go accounts with low spending history hit rate limits frequently, especially with gpt-4 and gpt-4-turbo models (which have lower TPM limits). If you're testing with concurrent requests or batch processing, you will hit these limits. This is by design: OpenAI throttles low-trust accounts.

Common gotcha

The most common mistake: developers assume the SDK will automatically retry on rate limits. It doesn't. You'll get a RateLimitError exception, and if you don't catch it, your entire request chain fails silently. You must wrap API calls in try/except and implement exponential backoff yourself.

Error recovery

RateLimitError

Caused by exceeding requests per minute or tokens per minute. Extract the Retry-After value from the error response (if present) or use exponential backoff (2^attempt seconds, capped at 60). Retry the exact same request after waiting.

AuthenticationError

Your OPENAI_API_KEY is missing, invalid, or has been revoked. Check that the environment variable is set before creating the client. If the key is correct, regenerate it in the dashboard: you may have rotated it elsewhere.

InternalServerError

OpenAI's server is down, not a rate limit. Wait 30 seconds and retry. This is rare (< 1% of requests).

Experienced dev note

Rate limit headers are per-model and per-account-tier, but they reset on a sliding 60-second window, not at a fixed time. A common production bug: developers log their TPM remaining and assume it's monotonic (always decreasing), then panic when it jumps back up. It jumps because the oldest request from 60 seconds ago just fell out of the window. Also, if you're using the async client (AsyncOpenAI), rate limit headers work the same way: don't make a separate synchronous call just to check limits.

Check your understanding

You have a Free tier account with a 40,000 TPM limit. Your first request uses 500 tokens. Your second request uses 2,000 tokens. The response header says x-ratelimit-remaining-tokens is 37,200. What rate limit window just passed, and can you safely make a third 10,000-token request immediately?

Show answer hint

Calculate what happened: 40,000 (limit) - 500 (first) - 2,000 (second) = 37,500 expected remaining, but you got 37,200. That's 300 tokens less: meaning a previous request from 50-60 seconds ago (300 tokens) just fell out the sliding window. The remaining 37,200 tokens is yours for the next 60 seconds, so a 10,000-token request is safe now, but you're at 70% capacity, leaving only 27,200 for the next minute. Decision depends on your request rate.

VERSION OpenAI SDK 1.0+ returns headers in the response._headers dict, but this is an implementation detail subject to change. Prefer accessing rate limit information via response object properties if available in future versions. As of April 2026, the _headers approach is standard but undocumented.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.