How to deploy vLLM with Docker
Quick answer
Deploy
vLLM with Docker by running the official vllm Docker image or building your own Dockerfile that installs vllm. Use the vllm serve CLI command inside the container to start the model server, then query it via HTTP or OpenAI-compatible API endpoints.PREREQUISITES
Docker installed (Docker Engine 20.10+)Python 3.8+ (for custom builds)Basic Docker CLI knowledge
Setup
Install Docker on your machine from the official site. Pull the vllm Docker image from Docker Hub or build your own image with vllm installed. Set environment variables as needed for model paths and ports.
docker pull vllm/vllm:latest output
Using default tag: latest latest: Pulling from vllm/vllm Digest: sha256:... Status: Downloaded newer image for vllm/vllm:latest
Step by step
Run the vllm server inside Docker with a command that exposes the HTTP API port and specifies the model to serve. Then query the server using the OpenAI-compatible API endpoint.
docker run --rm -p 8000:8000 vllm/vllm:latest \
vllm serve meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 8000 output
Starting vLLM server on 0.0.0.0:8000 Loaded model meta-llama/Llama-3.1-8B-Instruct Waiting for requests...
Querying the server
Use the OpenAI Python SDK with base_url pointing to your local vLLM server to send chat completions requests.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello from vLLM Docker!"}]
)
print(response.choices[0].message.content) output
Hello from vLLM Docker! How can I assist you today?
Common variations
- Build a custom Dockerfile to include additional dependencies or models.
- Run
vllmwith GPU support by using NVIDIA Docker and appropriate CUDA base images. - Use different models by changing the model name in the
vllm servecommand. - Run
vllmasynchronously or with streaming by configuring client SDK calls accordingly.
FROM python:3.10-slim
RUN pip install vllm
CMD ["vllm", "serve", "meta-llama/Llama-3.1-8B-Instruct", "--host", "0.0.0.0", "--port", "8000"] Troubleshooting
- If the server does not start, check Docker logs for missing dependencies or model download errors.
- For GPU usage, ensure NVIDIA drivers and
nvidia-docker2are installed and the container is run with--gpus all. - If requests time out, verify port mappings and firewall settings.
- Use
docker logs [container_id]to debug server startup issues.
Key Takeaways
- Use the official vLLM Docker image or build your own for flexible deployment.
- Expose port 8000 and run the vLLM server with the desired model inside the container.
- Query the running vLLM server via OpenAI-compatible API endpoints using the OpenAI SDK.
- Enable GPU support by running the container with NVIDIA Docker and CUDA base images.
- Check Docker logs and port mappings to troubleshoot common deployment issues.