How to intermediate · 3 min read

How to run LLM inference as background task in FastAPI

Quick answer

Use FastAPI's BackgroundTasks to run LLM inference asynchronously without blocking the main request thread. Integrate the OpenAI SDK inside a background function and trigger it from your endpoint to handle inference in the background.

PREREQUISITES

Python 3.8+
FastAPI
Uvicorn
OpenAI API key
pip install fastapi uvicorn openai>=1.0

Setup

Install required packages and set your OpenAI API key as an environment variable.

Install FastAPI, Uvicorn, and OpenAI SDK:

bash

pip install fastapi uvicorn openai>=1.0

Step by step

This example shows how to run an OpenAI LLM inference as a background task in FastAPI using BackgroundTasks. The inference runs asynchronously after the HTTP response is sent.

python

import os
from fastapi import FastAPI, BackgroundTasks
from openai import OpenAI

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def run_llm_inference(prompt: str):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    # Here you could save the result to a database or log it
    print("LLM response:", response.choices[0].message.content)

@app.post("/generate")
async def generate_text(prompt: str, background_tasks: BackgroundTasks):
    background_tasks.add_task(run_llm_inference, prompt)
    return {"message": "Inference started in background"}

# To run:
# uvicorn filename:app --reload

output

{"message": "Inference started in background"}

Common variations

Use synchronous function with def if your LLM client is sync.
Switch to other models like claude-3-5-sonnet-20241022 by changing the client and model.
Use FastAPI's BackgroundTasks for simple background jobs or Celery for distributed task queues.

python

from anthropic import Anthropic
from fastapi import FastAPI, BackgroundTasks

app = FastAPI()
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def run_claude_inference(prompt: str):
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": prompt}]
    )
    print("Claude response:", message.content)

@app.post("/generate-claude")
async def generate_claude(prompt: str, background_tasks: BackgroundTasks):
    background_tasks.add_task(run_claude_inference, prompt)
    return {"message": "Claude inference started in background"}

output

{"message": "Claude inference started in background"}

Troubleshooting

If background tasks do not run, ensure your server is not running with --reload in production as it may spawn multiple workers.
Check your API key environment variable is set correctly to avoid authentication errors.
Use logging instead of print statements for production to capture background task outputs.

✅

Key Takeaways

Use FastAPI's BackgroundTasks to run LLM inference asynchronously without blocking requests.
Always get API keys from environment variables for security and flexibility.
For heavy or distributed workloads, consider task queues like Celery instead of BackgroundTasks.
Switch models easily by changing the client and model parameter in your background function.

Verified 2026-04 · gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗