How to Intermediate · 3 min read

How to use vLLM pipeline parallelism

Q: How to use vLLM pipeline parallelism

Use vllm's LLM class with the num_pipeline_stages parameter to enable pipeline parallelism across multiple GPUs. This splits the model into stages, allowing concurrent execution and faster inference on large models.

Quick answer

Use vllm's LLM class with the num_pipeline_stages parameter to enable pipeline parallelism across multiple GPUs. This splits the model into stages, allowing concurrent execution and faster inference on large models.

PREREQUISITES

Python 3.8+
pip install vllm
Multiple GPUs available for pipeline parallelism

Setup

Install the vllm package via pip and ensure you have multiple GPUs available for pipeline parallelism. No API keys are required as vllm runs locally.

bash

pip install vllm

Step by step

This example demonstrates how to enable pipeline parallelism by specifying num_pipeline_stages when creating the LLM instance. The model is split into that many stages across GPUs.

python

from vllm import LLM, SamplingParams

# Create LLM instance with 2 pipeline stages (requires 2 GPUs)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", num_pipeline_stages=2)

# Generate text with sampling parameters
outputs = llm.generate([
    "Write a short poem about AI pipeline parallelism."
], SamplingParams(temperature=0.7, max_tokens=100))

print(outputs[0].outputs[0].text)

output

Write a short poem about AI pipeline parallelism.

In GPUs' dance, the stages align,
Parallel paths where models shine.
Speed and scale in harmony,
Pipeline power sets AI free.

Common variations

Adjust pipeline stages: Change num_pipeline_stages to match your GPU count for optimal utilization.
Use different models: Replace model with any compatible local or Hugging Face model.
Async generation: Use llm.generate() for generation; generate_async() is not supported in vLLM.

Troubleshooting

If you see errors about GPU availability, verify your system has enough GPUs and CUDA is configured correctly.
Pipeline parallelism requires model partitioning support; ensure your model is compatible.
For memory errors, reduce batch size or number of pipeline stages.

Key Takeaways

Enable pipeline parallelism in vLLM by setting num_pipeline_stages to split the model across GPUs.
Pipeline parallelism improves throughput and reduces latency for large models on multi-GPU setups.
Adjust the number of pipeline stages to match your hardware for best performance.
Ensure CUDA and GPU drivers are properly installed to avoid runtime errors.
vLLM runs locally and requires no API keys, making it ideal for on-premise inference.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.