How to handle concurrent LLM requests in FastAPI
api_error Why this happens
FastAPI supports asynchronous request handling, but if you call LLM APIs synchronously inside your route handlers, it blocks the event loop. This leads to poor concurrency and slow response times. For example, using synchronous SDK calls or blocking HTTP clients inside async endpoints causes the server to handle requests sequentially, not concurrently.
Typical error output includes slow responses or timeouts under load, and you may see warnings about blocking the event loop.
from fastapi import FastAPI
from openai import OpenAI
import os
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
@app.get("/generate")
def generate():
# Synchronous call blocks event loop
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
return {"text": response.choices[0].message.content} The fix
Make your FastAPI route handlers async and use asynchronous SDK methods or HTTP clients that support async calls. The official OpenAI Python SDK supports async calls via await. This allows FastAPI to handle multiple LLM requests concurrently without blocking.
Example below shows an async endpoint using the OpenAI SDK's async method to call chat.completions.create. This improves throughput and reduces latency under concurrent load.
from fastapi import FastAPI
from openai import OpenAI
import os
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
@app.get("/generate")
async def generate():
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
return {"text": response.choices[0].message.content} {"text": "Hello! How can I assist you today?"} Preventing it in production
To ensure robust concurrency in production, implement exponential backoff and retry logic for rate limits or transient errors. Use connection pooling and limit the number of concurrent requests to the LLM API to avoid hitting provider limits.
Consider using task queues or background workers for heavy or batch LLM calls. Monitor latency and error rates to adjust concurrency settings dynamically.
Key Takeaways
- Always use async def endpoints in FastAPI for concurrent LLM calls.
- Use async SDK methods like
acreateto avoid blocking the event loop. - Implement retries and backoff to handle rate limits gracefully.