How to batch LLM requests in Python
Quick answer
Batch multiple LLM requests in Python by sending them as a list of messages in a single API call using the
chat.completions.create method with the OpenAI SDK. This reduces overhead and cost by consolidating calls instead of sending them individually.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the official openai Python package (v1+) and set your API key as an environment variable for secure authentication.
pip install openai>=1.0 output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
Use the OpenAI client to send a batch of prompts as separate messages in a single chat.completions.create call. Each message represents one request, and the response contains a list of completions matching the batch order.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Prepare batch messages
batch_messages = [
{"role": "user", "content": "Translate 'Hello' to French."},
{"role": "user", "content": "Summarize the benefits of AI."},
{"role": "user", "content": "Write a haiku about spring."}
]
# Send batch request
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=batch_messages
)
# Extract and print each completion
for i, choice in enumerate(response.choices):
print(f"Response {i+1}:", choice.message.content) output
Response 1: Bonjour Response 2: AI improves efficiency, automates tasks, and enables new insights. Response 3: Blossoms gently fall, Spring whispers in soft breezes, Nature's poem blooms.
Common variations
- Async requests: Use async client methods to batch requests concurrently for better throughput.
- Streaming: Batch streaming is not supported; stream each request individually.
- Different models: Change the
modelparameter to any supported model likegpt-4o-miniorclaude-3-5-sonnet-20241022. - SDK alternatives: Anthropic and other providers support similar batching by sending multiple messages in one call.
import asyncio
import os
from openai import OpenAI
async def batch_async():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
batch_messages = [
{"role": "user", "content": "Explain recursion."},
{"role": "user", "content": "What is RAG in AI?"}
]
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=batch_messages
)
for i, choice in enumerate(response.choices):
print(f"Async response {i+1}:", choice.message.content)
asyncio.run(batch_async()) output
Async response 1: Recursion is a programming technique where a function calls itself to solve smaller instances of a problem. Async response 2: RAG stands for Retrieval-Augmented Generation, combining retrieval of documents with LLM generation.
Troubleshooting
- If you receive a
400 Bad Request, ensure yourmessagesparameter is a list of valid message objects withroleandcontent. - If the batch size is too large, you may hit token or request size limits; split batches accordingly.
- Check your API key environment variable is set correctly to avoid authentication errors.
Key Takeaways
- Batching multiple prompts in one
chat.completions.createcall reduces API overhead and cost. - Each message in the batch corresponds to one request and one completion in the response.
- Use async calls for concurrent batch processing to improve throughput.
- Respect token and request size limits by controlling batch size.
- Always validate message format and environment variables to avoid common errors.