How to beginner · 3 min read

How to batch requests in vLLM

Quick answer
Use the vllm.LLM class to batch multiple prompts by passing a list of strings to the generate() method along with SamplingParams. This runs inference on all prompts in a single batch efficiently. For example, llm.generate(["prompt1", "prompt2"], SamplingParams()) returns a list of outputs, one per prompt.

PREREQUISITES

  • Python 3.8+
  • pip install vllm
  • Basic knowledge of Python async is optional

Setup

Install the vllm package via pip and ensure you have Python 3.8 or newer.

Run the following command to install:

bash
pip install vllm

Step by step

Use the LLM class from vllm to batch multiple prompts. Pass a list of prompt strings to generate() along with SamplingParams. The method returns a list of outputs corresponding to each prompt.

python
from vllm import LLM, SamplingParams

# Initialize the LLM with a model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# List of prompts to batch
prompts = [
    "Write a poem about spring.",
    "Explain quantum computing in simple terms.",
    "Generate a Python function to reverse a string."
]

# Generate outputs for all prompts in a single batch
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=100))

# Print each output
for i, output in enumerate(outputs):
    print(f"Output for prompt {i+1}:")
    print(output.outputs[0].text)
    print("---")
output
Output for prompt 1:
A gentle breeze, the flowers bloom,
Springtime dances, life resumes.
---
Output for prompt 2:
Quantum computing uses quantum bits, or qubits, which can be in multiple states at once, enabling powerful parallel computations.
---
Output for prompt 3:
def reverse_string(s):
    return s[::-1]
---

Common variations

You can batch requests asynchronously by running generate() in an async context if you integrate with async frameworks. Also, you can change the model by specifying a different model name when initializing LLM. Adjust SamplingParams to control temperature, max tokens, and other generation parameters.

python
import asyncio
from vllm import LLM, SamplingParams

async def async_batch():
    llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
    prompts = ["Hello world", "What is AI?"]
    # Run generate in a thread-safe async wrapper
    loop = asyncio.get_running_loop()
    outputs = await loop.run_in_executor(None, lambda: llm.generate(prompts, SamplingParams(max_tokens=50)))
    for output in outputs:
        print(output.outputs[0].text)

# To run async_batch(), use: asyncio.run(async_batch())
output
Hello world generated text...
AI is the simulation of human intelligence by machines...

Troubleshooting

  • If you get a ModuleNotFoundError, ensure vllm is installed with pip install vllm.
  • If the model fails to load, verify the model name is correct and the model files are accessible.
  • For memory errors, reduce batch size or max tokens in SamplingParams.

Key Takeaways

  • Batch multiple prompts by passing a list of strings to LLM.generate() with SamplingParams.
  • Use vllm to efficiently run inference on batches, reducing overhead and latency.
  • Adjust SamplingParams to control generation behavior per batch.
  • Async batching can be done by running generate() in an async executor.
  • Troubleshoot by checking installation, model names, and resource limits.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗