Throughput comparison: vLLM vs transformers
Why this matters
As a developer, you need to know whether your inference setup can handle production load. A model running at 50 tokens/sec on transformers might hit 500 tokens/sec on vLLM with zero code changes: that's the difference between a responsive chatbot and a frustrating one.
Explanation
What it is: vLLM is a serving framework that schedules multiple requests together on a GPU using PagedAttention, while transformers loads and runs models directly in a loop. Both run the same LLaMA model; vLLM just schedules the work smarter.
How it works mechanically: When 10 requests arrive, transformers queues them and processes them one-by-one, keeping GPU idle between requests. vLLM pages the KV cache (the attention memory) into fixed-size blocks, packs multiple requests into one GPU batch, and fills "holes" with new requests: like filling empty seats on a bus instead of waiting for one passenger at a time. This reduces memory fragmentation and increases GPU utilization from 30% to 80%+.
When to use it: Use vLLM if you have: concurrent users, API endpoints, or batch jobs. Use transformers if you're prototyping locally, running inference once, or have hard latency requirements (vLLM adds minimal but non-zero overhead per request).
Analogy
Transformers is like a coffee shop serving one customer at a time: safe, simple, slow. vLLM is a fast-casual place with an ordering queue: it waits for 3–4 customers, makes all their orders in parallel, then hands them out. Same espresso quality, 5x throughput.
Code
import torch
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
torch.cuda.empty_cache()
model_id = 'meta-llama/Llama-3.2-1B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map='auto'
)
prompt = 'Explain quantum computing in one sentence.'
prompts = [prompt] * 4
print('=== transformers (sequential) ===')
start = time.perf_counter()
for p in prompts:
inputs = tokenizer(p, return_tensors='pt').to(model.device)
with torch.no_grad():
outputs = model.generate(
inputs['input_ids'],
max_new_tokens=30,
pad_token_id=tokenizer.eos_token_id
)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f'Generated: {decoded[:60]}...')
transformers_time = time.perf_counter() - start
print(f'Total time (transformers): {transformers_time:.2f}s')
print(f'Throughput: {4 * 30 / transformers_time:.1f} tokens/sec\n')
print('=== vLLM (batched, if installed) ===')
try:
from vllm import LLM, SamplingParams
llm = LLM(
model=model_id,
dtype='float16',
tensor_parallel_size=1
)
sampling_params = SamplingParams(max_tokens=30)
start = time.perf_counter()
outputs_vllm = llm.generate(prompts, sampling_params)
vllm_time = time.perf_counter() - start
for output in outputs_vllm:
print(f'Generated: {output.outputs[0].text[:60]}...')
print(f'Total time (vLLM): {vllm_time:.2f}s')
print(f'Throughput: {4 * 30 / vllm_time:.1f} tokens/sec')
print(f'Speedup: {transformers_time / vllm_time:.1f}x')
except ImportError:
print('vLLM not installed. Install with: pip install vllm')
print('Expected speedup on GPU: 2–10x depending on batch size and model.') === transformers (sequential) === Generated: Explain quantum computing in one sentence. Quantum comput... Generated: Explain quantum computing in one sentence. Quantum comput... Generated: Explain quantum computing in one sentence. Quantum comput... Generated: Explain quantum computing in one sentence. Quantum comput... Total time (transformers): 3.42s Throughput: 35.1 tokens/sec === vLLM (batched, if installed) === vLLM not installed. Install with: pip install vllm Expected speedup on GPU: 2–10x depending on batch size and model.
What just happened?
The code loaded Llama 3.2 1B in float16 and ran 4 identical inference requests. With transformers, it queued and executed them one at a time, blocking the GPU between requests. We measured wall-clock time and divided total tokens (4 requests × 30 tokens) by runtime to get tokens/sec. The vLLM section shows what would happen if installed: it batches all 4 requests together, reusing compute and reducing memory allocation overhead, typically delivering 3–5x higher throughput on modest hardware.
Common gotcha
Developers measure transformers throughput on a single request and vLLM throughput on a batch of 4, then claim vLLM is always faster. The truth: transformers on a single request is actually faster: vLLM's advantage only appears under concurrent load. Also, vLLM requires a GPU; on CPU, overhead makes it slower than transformers.
Error recovery
ImportError (vLLM)OutOfMemoryError during model loadTokenizerFastExceptionExperienced dev note
vLLM's PagedAttention is not magic: it only helps when requests overlap. If you're building a single-user chatbot, use transformers for simplicity. But if you're building an API that handles 5+ concurrent users, vLLM becomes mandatory; it's the difference between scaling to 100 QPS or 500 QPS on the same $500 GPU. Also: vLLM's batching is automatic at the server level, so you don't rewrite your inference code: you just swap the runtime. That's why it wins in production.
Check your understanding
Why does vLLM's throughput advantage disappear when you benchmark a single request, but transformers and vLLM handle concurrent requests differently? What is the GPU doing while waiting for the next request in transformers?
Show answer hint
A correct answer recognizes that (1) vLLM's batching only helps when requests arrive together or queue up, (2) a single request has no batch advantage, and (3) in transformers, the GPU idles between sequential requests because the loop is CPU-bound, whereas vLLM keeps the GPU fed continuously.