How to batch requests in vLLM
Quick answer
Use the
vllm.LLM class to batch multiple prompts by passing a list of strings to the generate() method along with SamplingParams. This runs inference on all prompts in a single batch efficiently. For example, llm.generate(["prompt1", "prompt2"], SamplingParams()) returns a list of outputs, one per prompt.PREREQUISITES
Python 3.8+pip install vllmBasic knowledge of Python async is optional
Setup
Install the vllm package via pip and ensure you have Python 3.8 or newer.
Run the following command to install:
pip install vllm Step by step
Use the LLM class from vllm to batch multiple prompts. Pass a list of prompt strings to generate() along with SamplingParams. The method returns a list of outputs corresponding to each prompt.
from vllm import LLM, SamplingParams
# Initialize the LLM with a model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
# List of prompts to batch
prompts = [
"Write a poem about spring.",
"Explain quantum computing in simple terms.",
"Generate a Python function to reverse a string."
]
# Generate outputs for all prompts in a single batch
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=100))
# Print each output
for i, output in enumerate(outputs):
print(f"Output for prompt {i+1}:")
print(output.outputs[0].text)
print("---") output
Output for prompt 1:
A gentle breeze, the flowers bloom,
Springtime dances, life resumes.
---
Output for prompt 2:
Quantum computing uses quantum bits, or qubits, which can be in multiple states at once, enabling powerful parallel computations.
---
Output for prompt 3:
def reverse_string(s):
return s[::-1]
--- Common variations
You can batch requests asynchronously by running generate() in an async context if you integrate with async frameworks. Also, you can change the model by specifying a different model name when initializing LLM. Adjust SamplingParams to control temperature, max tokens, and other generation parameters.
import asyncio
from vllm import LLM, SamplingParams
async def async_batch():
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
prompts = ["Hello world", "What is AI?"]
# Run generate in a thread-safe async wrapper
loop = asyncio.get_running_loop()
outputs = await loop.run_in_executor(None, lambda: llm.generate(prompts, SamplingParams(max_tokens=50)))
for output in outputs:
print(output.outputs[0].text)
# To run async_batch(), use: asyncio.run(async_batch()) output
Hello world generated text... AI is the simulation of human intelligence by machines...
Troubleshooting
- If you get a
ModuleNotFoundError, ensurevllmis installed withpip install vllm. - If the model fails to load, verify the model name is correct and the model files are accessible.
- For memory errors, reduce batch size or max tokens in
SamplingParams.
Key Takeaways
- Batch multiple prompts by passing a list of strings to
LLM.generate()withSamplingParams. - Use
vllmto efficiently run inference on batches, reducing overhead and latency. - Adjust
SamplingParamsto control generation behavior per batch. - Async batching can be done by running
generate()in an async executor. - Troubleshoot by checking installation, model names, and resource limits.