How to use vLLM pipeline parallelism
Quick answer
Use
vllm's LLM class with the num_pipeline_stages parameter to enable pipeline parallelism across multiple GPUs. This splits the model into stages, allowing concurrent execution and faster inference on large models.PREREQUISITES
Python 3.8+pip install vllmMultiple GPUs available for pipeline parallelism
Setup
Install the vllm package via pip and ensure you have multiple GPUs available for pipeline parallelism. No API keys are required as vllm runs locally.
pip install vllm Step by step
This example demonstrates how to enable pipeline parallelism by specifying num_pipeline_stages when creating the LLM instance. The model is split into that many stages across GPUs.
from vllm import LLM, SamplingParams
# Create LLM instance with 2 pipeline stages (requires 2 GPUs)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", num_pipeline_stages=2)
# Generate text with sampling parameters
outputs = llm.generate([
"Write a short poem about AI pipeline parallelism."
], SamplingParams(temperature=0.7, max_tokens=100))
print(outputs[0].outputs[0].text) output
Write a short poem about AI pipeline parallelism. In GPUs' dance, the stages align, Parallel paths where models shine. Speed and scale in harmony, Pipeline power sets AI free.
Common variations
- Adjust pipeline stages: Change
num_pipeline_stagesto match your GPU count for optimal utilization. - Use different models: Replace
modelwith any compatible local or Hugging Face model. - Async generation: Use
llm.generate()for generation;generate_async()is not supported in vLLM.
Troubleshooting
- If you see errors about GPU availability, verify your system has enough GPUs and CUDA is configured correctly.
- Pipeline parallelism requires model partitioning support; ensure your model is compatible.
- For memory errors, reduce batch size or number of pipeline stages.
Key Takeaways
- Enable pipeline parallelism in vLLM by setting num_pipeline_stages to split the model across GPUs.
- Pipeline parallelism improves throughput and reduces latency for large models on multi-GPU setups.
- Adjust the number of pipeline stages to match your hardware for best performance.
- Ensure CUDA and GPU drivers are properly installed to avoid runtime errors.
- vLLM runs locally and requires no API keys, making it ideal for on-premise inference.