How to beginner · 3 min read

How to batch requests in vLLM

Q: How to batch requests in vLLM

Use the vllm.LLM class to batch multiple prompts by passing a list of strings to the generate() method along with SamplingParams. This runs inference on all prompts in a single batch efficiently. For example, llm.generate(["prompt1", "prompt2"], SamplingParams()) returns a list of outputs, one per prompt.

Quick answer

Use the vllm.LLM class to batch multiple prompts by passing a list of strings to the generate() method along with SamplingParams. This runs inference on all prompts in a single batch efficiently. For example, llm.generate(["prompt1", "prompt2"], SamplingParams()) returns a list of outputs, one per prompt.

PREREQUISITES

Python 3.8+
pip install vllm
Basic knowledge of Python async is optional

Setup

Install the vllm package via pip and ensure you have Python 3.8 or newer.

Run the following command to install:

bash

pip install vllm

Step by step

Use the LLM class from vllm to batch multiple prompts. Pass a list of prompt strings to generate() along with SamplingParams. The method returns a list of outputs corresponding to each prompt.

python

from vllm import LLM, SamplingParams

# Initialize the LLM with a model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# List of prompts to batch
prompts = [
    "Write a poem about spring.",
    "Explain quantum computing in simple terms.",
    "Generate a Python function to reverse a string."
]

# Generate outputs for all prompts in a single batch
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=100))

# Print each output
for i, output in enumerate(outputs):
    print(f"Output for prompt {i+1}:")
    print(output.outputs[0].text)
    print("---")

output

Output for prompt 1:
A gentle breeze, the flowers bloom,
Springtime dances, life resumes.
---
Output for prompt 2:
Quantum computing uses quantum bits, or qubits, which can be in multiple states at once, enabling powerful parallel computations.
---
Output for prompt 3:
def reverse_string(s):
    return s[::-1]
---

Common variations

You can batch requests asynchronously by running generate() in an async context if you integrate with async frameworks. Also, you can change the model by specifying a different model name when initializing LLM. Adjust SamplingParams to control temperature, max tokens, and other generation parameters.

python

import asyncio
from vllm import LLM, SamplingParams

async def async_batch():
    llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
    prompts = ["Hello world", "What is AI?"]
    # Run generate in a thread-safe async wrapper
    loop = asyncio.get_running_loop()
    outputs = await loop.run_in_executor(None, lambda: llm.generate(prompts, SamplingParams(max_tokens=50)))
    for output in outputs:
        print(output.outputs[0].text)

# To run async_batch(), use: asyncio.run(async_batch())

output

Hello world generated text...
AI is the simulation of human intelligence by machines...

Troubleshooting

If you get a ModuleNotFoundError, ensure vllm is installed with pip install vllm.
If the model fails to load, verify the model name is correct and the model files are accessible.
For memory errors, reduce batch size or max tokens in SamplingParams.

✅

Key Takeaways

Batch multiple prompts by passing a list of strings to LLM.generate() with SamplingParams.
Use vllm to efficiently run inference on batches, reducing overhead and latency.
Adjust SamplingParams to control generation behavior per batch.
Async batching can be done by running generate() in an async executor.
Troubleshoot by checking installation, model names, and resource limits.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗