429
HTTP 429 Too Many Requests (Qwen2.5-Coder free tier rate limit)
Stack trace
HTTP 429 Too Many Requests
Error: {"error": {"message": "Rate limit exceeded. Free tier allows 30 requests/minute and 100K tokens/day for code generation tasks.", "code": "rate_limit_exceeded", "retry_after": 60}}
Response status: 429 Why it happens
Qwen2.5-Coder's free tier implements strict rate limiting to prevent abuse: 30 requests/minute and 100K tokens/day for programming tasks. When you exceed these quotas (common when batch-processing multiple files, running iterative code generation, or testing), the API rejects requests with HTTP 429. Code-specific tasks (file analysis, refactoring, test generation) consume tokens faster than general chat, triggering limits sooner on free tier.
Detection
Monitor HTTP response status in your client; 429 is the explicit signal. Before hitting production, test with representative workload size and measure tokens-per-task to estimate daily quota burn. Add logging for response headers like 'x-ratelimit-remaining' and 'x-ratelimit-reset' to warn before 429 hits.
Causes & fixes
Sending more than 30 requests/minute to Qwen2.5-Coder free API
Implement exponential backoff: catch 429, wait 60+ seconds (use Retry-After header), then retry. Use tenacity library: @retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(3))
Exceeding 100K tokens/day quota on free tier for code generation tasks
Upgrade to paid tier (removes daily token limit), OR batch tasks to stay under 100K tokens/day (estimate: ~50-100 small file analyses), OR split across multiple days with caching to avoid re-analyzing identical code
Making rapid sequential code analysis calls without delays (testing loop, batch processing)
Add time.sleep(2) between requests, or use queue-based processing with 60+ second cooldown after every 30 requests. For batch tasks, space requests 2+ seconds apart.
Using free tier for production code generation service (expected high volume)
Migrate to paid tier immediately. Free tier is for testing/development only. Paid tier offers 100+ requests/minute and no daily token limit for reliable production use.
Code: broken vs fixed
import requests
import os
api_key = os.environ.get('QWEN_API_KEY')
model = 'qwen2.5-coder-32b'
code_to_analyze = '''
def fibonacci(n):
if n <= 1: return n
return fibonacci(n-1) + fibonacci(n-2) # O(2^n) — inefficient
'''
# This will hit rate limit on free tier after ~30 requests without backoff
for i in range(50):
response = requests.post(
'https://api.qwen.alibaba.com/api/v1/chat/completions',
headers={'Authorization': f'Bearer {api_key}'},
json={
'model': model,
'messages': [
{'role': 'user', 'content': f'Optimize this code:\n{code_to_analyze}'}
],
'temperature': 0.7
}
)
# BAD: No rate limit handling — crashes with 429 after 30 requests
print(f'Request {i}: {response.status_code}')
if response.status_code != 200:
print(f'Error: {response.text}')
break # Stops immediately on 429 import requests
import os
import time
from tenacity import retry, wait_exponential, stop_after_attempt
api_key = os.environ.get('QWEN_API_KEY')
model = 'qwen2.5-coder-32b'
code_to_analyze = '''
def fibonacci(n):
if n <= 1: return n
return fibonacci(n-1) + fibonacci(n-2) # O(2^n) — inefficient
'''
# FIXED: Implement exponential backoff retry decorator
@retry(
wait=wait_exponential(multiplier=1, min=4, max=60),
stop=stop_after_attempt(3)
)
def call_qwen_with_backoff(prompt):
response = requests.post(
'https://api.qwen.alibaba.com/api/v1/chat/completions',
headers={'Authorization': f'Bearer {api_key}'},
json={
'model': model,
'messages': [
{'role': 'user', 'content': prompt}
],
'temperature': 0.7
},
timeout=30
)
# Raise exception on 429 to trigger retry
if response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 60))
raise Exception(f'Rate limited. Retry after {retry_after}s')
response.raise_for_status()
return response.json()
# Process with built-in backoff — won't crash on 429
for i in range(50):
try:
result = call_qwen_with_backoff(f'Optimize this code:\n{code_to_analyze}')
print(f'Request {i}: Success — {result["choices"][0]["message"]["content"][:50]}...')
time.sleep(2) # Additional: space requests 2+ seconds apart
except Exception as e:
print(f'Request {i}: Failed after retries — {e}')
break Workaround
If you cannot upgrade tier immediately: (1) Cache analysis results by code hash to avoid re-analyzing identical files. (2) Batch requests manually: process max 25 files/day on free tier. (3) Use synchronous queue with enforced 3-second delays between each request. (4) For testing, use Qwen2.5-Coder local via Ollama (ollama pull qwen2.5-coder) to avoid any API rate limits during development. (5) Switch to Claude 3.5 Haiku or GPT-4o-mini during free tier cooldown: both have higher free tier limits for code tasks.
Prevention
Architect for rate limits from day one: (1) Implement retry middleware at the HTTP client level (tenacity, httpx with retry policies). (2) Cache code analysis results in Redis/DynamoDB keyed by content hash. (3) Use request queuing (Celery, RQ) with configurable per-minute throughput to stay under 30 req/min. (4) Monitor token burn rate: log tokens-used after each request and stop when daily total approaches 80K. (5) For production: upgrade to paid tier immediately (Qwen API paid tier: ~$0.002/1K tokens input, no rate limits). (6) Use Qwen2.5-Coder 7B via local Ollama for unlimited local inference (no API calls, no rate limits).