How to Intermediate · 3 min read

How to optimize FastAPI LLM endpoint latency

Quick answer
To optimize latency in a FastAPI endpoint serving large language models, use asynchronous API calls with async functions, enable HTTP connection reuse with persistent sessions, and implement request batching or caching. These techniques reduce wait times and improve throughput when calling LLM APIs like OpenAI or Anthropic.

PREREQUISITES

  • Python 3.8+
  • FastAPI
  • HTTPX or aiohttp for async HTTP requests
  • OpenAI or Anthropic API key
  • pip install fastapi uvicorn httpx

Setup

Install FastAPI and an async HTTP client like httpx to make non-blocking calls to the LLM API. Set your API key as an environment variable for secure access.

bash
pip install fastapi uvicorn httpx

Step by step

Use async def in your FastAPI route and httpx.AsyncClient to call the LLM API asynchronously. Reuse the HTTP client across requests to enable connection pooling and reduce TCP handshake overhead.

python
import os
import asyncio
from fastapi import FastAPI
import httpx

app = FastAPI()

# Reuse AsyncClient for connection pooling
client = httpx.AsyncClient()

@app.on_event("shutdown")
async def shutdown_event():
    await client.aclose()

@app.post("/generate")
async def generate(prompt: str):
    headers = {"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"}
    json_data = {
        "model": "gpt-4o",
        "messages": [{"role": "user", "content": prompt}]
    }
    response = await client.post(
        "https://api.openai.com/v1/chat/completions",
        headers=headers,
        json=json_data,
        timeout=10.0
    )
    response.raise_for_status()
    data = response.json()
    return {"response": data["choices"][0]["message"]["content"]}

# Run with: uvicorn filename:app --reload

Common variations

  • Use asyncio.gather to batch multiple LLM requests concurrently.
  • Implement caching with Redis or in-memory to avoid repeated calls for the same prompt.
  • Switch to other async HTTP clients like aiohttp if preferred.
  • Use different models like claude-3-5-haiku-20241022 with Anthropic SDK asynchronously.
python
from anthropic import Anthropic
import os
from fastapi import FastAPI

app = FastAPI()
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

@app.post("/generate_claude")
async def generate_claude(prompt: str):
    message = await client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=512,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": prompt}]
    )
    return {"response": message.content}

Troubleshooting

  • If you see TimeoutError, increase the HTTP client timeout or check network connectivity.
  • For ConnectionError, ensure persistent client reuse and avoid creating a new client per request.
  • High latency may be due to synchronous blocking calls; verify all LLM calls are async.
  • Use logging to measure request durations and identify bottlenecks.

Key Takeaways

  • Use asynchronous HTTP clients like httpx.AsyncClient to avoid blocking FastAPI event loop.
  • Reuse HTTP client instances to enable connection pooling and reduce latency.
  • Batch multiple LLM requests concurrently with asyncio.gather for throughput gains.
  • Implement caching to prevent redundant LLM API calls for repeated prompts.
  • Monitor and adjust HTTP timeouts and handle exceptions to maintain endpoint reliability.
Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022
Verify ↗