How to Intermediate · 3 min read

How to profile vLLM performance

Quick answer
To profile vLLM performance, measure inference latency and throughput by timing llm.generate() calls with SamplingParams. Use Python's time module or profiling tools to capture detailed metrics during batch generation.

PREREQUISITES

  • Python 3.8+
  • pip install vllm
  • Basic knowledge of Python timing and profiling

Setup

Install the vllm package and ensure Python 3.8 or higher is used. No API key is required for local inference.

bash
pip install vllm

Step by step

This example shows how to measure latency and throughput of vLLM inference by timing batch generation calls.

python
from vllm import LLM, SamplingParams
import time

# Initialize the LLM with a local model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Prepare prompts
prompts = ["Hello, how are you?", "What is the capital of France?", "Explain quantum computing."]

# Define sampling parameters
sampling_params = SamplingParams(temperature=0.7, max_tokens=50)

# Measure start time
start_time = time.time()

# Generate outputs
outputs = llm.generate(prompts, sampling_params)

# Measure end time
end_time = time.time()

# Calculate latency and throughput
latency = end_time - start_time
throughput = len(prompts) / latency

# Print results
for i, output in enumerate(outputs):
    print(f"Prompt {i+1}: {prompts[i]}")
    print(f"Response: {output.outputs[0].text.strip()}\n")

print(f"Total latency: {latency:.3f} seconds")
print(f"Throughput: {throughput:.2f} prompts/second")
output
Prompt 1: Hello, how are you?
Response: I'm doing well, thank you! How can I assist you today?

Prompt 2: What is the capital of France?
Response: The capital of France is Paris.

Prompt 3: Explain quantum computing.
Response: Quantum computing is a type of computation that uses quantum-mechanical phenomena such as superposition and entanglement to perform operations on data.

Total latency: 2.345 seconds
Throughput: 1.28 prompts/second

Common variations

You can profile vLLM asynchronously using AsyncLLMEngine or measure performance with different models by changing the model parameter. For server-based usage, start a vLLM server and query it via the OpenAI-compatible API with timing.

python
# CLI: Start vLLM server
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Python: Query running server with timing
from openai import OpenAI
import os
import time

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

messages = [{"role": "user", "content": "Explain AI."}]

start = time.time()
response = client.chat.completions.create(model="meta-llama/Llama-3.1-8B-Instruct", messages=messages)
end = time.time()

print("Response:", response.choices[0].message.content)
print(f"Latency: {end - start:.3f} seconds")
output
Response: AI, or artificial intelligence, refers to the simulation of human intelligence in machines...
Latency: 0.456 seconds

Troubleshooting

If you see unusually high latency, check your hardware utilization and ensure your GPU drivers are up to date. For memory errors, reduce max_tokens or batch size. Use Python profilers like cProfile for deeper analysis.

Key Takeaways

  • Use Python's time module to measure vLLM inference latency and throughput.
  • Batch multiple prompts to get accurate throughput metrics.
  • Start a vLLM server for scalable profiling via OpenAI-compatible API calls.
  • Adjust SamplingParams and batch size to optimize performance and memory usage.
  • Use system monitoring and Python profilers for detailed bottleneck analysis.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗