Debug Fix beginner · 3 min read

Together AI rate limits

Quick answer
Together AI enforces rate limits to control API usage and prevent overload, resulting in RateLimitError when exceeded. To handle this, implement exponential backoff retry logic around your client.chat.completions.create() calls to automatically retry after delays.
ERROR TYPE api_error
⚡ QUICK FIX
Add exponential backoff retry logic around your API call to handle RateLimitError automatically.

Why this happens

Together AI applies rate limits to restrict the number of API requests per minute or second to ensure fair usage and system stability. When your application sends requests too quickly or exceeds your quota, the API responds with a RateLimitError. This error typically looks like:

openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.

Example of triggering code without retry handling:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"])

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
output
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.

The fix

Implement exponential backoff retry logic to catch RateLimitError and retry the request after a delay. This prevents immediate failure and respects Together AI's rate limits.

Example with retry logic:

python
import time
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"])

max_retries = 5
retry_delay = 1  # initial delay in seconds

for attempt in range(max_retries):
    try:
        response = client.chat.completions.create(
            model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
            messages=[{"role": "user", "content": "Hello"}]
        )
        print(response.choices[0].message.content)
        break  # success, exit loop
    except Exception as e:
        if "RateLimitError" in str(e):
            print(f"Rate limit hit, retrying in {retry_delay} seconds...")
            time.sleep(retry_delay)
            retry_delay *= 2  # exponential backoff
        else:
            raise
else:
    print("Failed after multiple retries due to rate limits.")
output
Hello
# or
Rate limit hit, retrying in 1 seconds...
Hello

Preventing it in production

  • Use exponential backoff with jitter to avoid synchronized retries.
  • Monitor your API usage and upgrade your Together AI plan if needed.
  • Implement client-side rate limiting to throttle requests before hitting the API.
  • Cache frequent responses to reduce unnecessary calls.
  • Log and alert on repeated RateLimitError to proactively manage usage.

Key Takeaways

  • Together AI enforces strict rate limits causing RateLimitError when exceeded.
  • Use exponential backoff retry logic to handle rate limits gracefully and avoid crashes.
  • Monitor usage and implement client-side throttling to prevent hitting limits in production.
Verified 2026-04 · meta-llama/Llama-3.3-70B-Instruct-Turbo
Verify ↗