How to run LLM inference as background task in FastAPI
Quick answer
Use FastAPI's
BackgroundTasks to run LLM inference asynchronously without blocking the main request thread. Integrate the OpenAI SDK inside a background function and trigger it from your endpoint to handle inference in the background.PREREQUISITES
Python 3.8+FastAPIUvicornOpenAI API keypip install fastapi uvicorn openai>=1.0
Setup
Install required packages and set your OpenAI API key as an environment variable.
- Install FastAPI, Uvicorn, and OpenAI SDK:
pip install fastapi uvicorn openai>=1.0 Step by step
This example shows how to run an OpenAI LLM inference as a background task in FastAPI using BackgroundTasks. The inference runs asynchronously after the HTTP response is sent.
import os
from fastapi import FastAPI, BackgroundTasks
from openai import OpenAI
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def run_llm_inference(prompt: str):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
# Here you could save the result to a database or log it
print("LLM response:", response.choices[0].message.content)
@app.post("/generate")
async def generate_text(prompt: str, background_tasks: BackgroundTasks):
background_tasks.add_task(run_llm_inference, prompt)
return {"message": "Inference started in background"}
# To run:
# uvicorn filename:app --reload output
{"message": "Inference started in background"} Common variations
- Use synchronous function with
defif your LLM client is sync. - Switch to other models like
claude-3-5-sonnet-20241022by changing the client and model. - Use FastAPI's
BackgroundTasksfor simple background jobs orCeleryfor distributed task queues.
from anthropic import Anthropic
from fastapi import FastAPI, BackgroundTasks
app = FastAPI()
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def run_claude_inference(prompt: str):
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": prompt}]
)
print("Claude response:", message.content)
@app.post("/generate-claude")
async def generate_claude(prompt: str, background_tasks: BackgroundTasks):
background_tasks.add_task(run_claude_inference, prompt)
return {"message": "Claude inference started in background"} output
{"message": "Claude inference started in background"} Troubleshooting
- If background tasks do not run, ensure your server is not running with
--reloadin production as it may spawn multiple workers. - Check your API key environment variable is set correctly to avoid authentication errors.
- Use logging instead of print statements for production to capture background task outputs.
Key Takeaways
- Use FastAPI's BackgroundTasks to run LLM inference asynchronously without blocking requests.
- Always get API keys from environment variables for security and flexibility.
- For heavy or distributed workloads, consider task queues like Celery instead of BackgroundTasks.
- Switch models easily by changing the client and model parameter in your background function.