How to beginner · 3 min read

How to deploy vLLM with Docker

Q: How to deploy vLLM with Docker

Deploy vLLM with Docker by running the official vllm Docker image or building your own Dockerfile that installs vllm. Use the vllm serve CLI command inside the container to start the model server, then query it via HTTP or OpenAI-compatible API endpoints.

Quick answer

Deploy vLLM with Docker by running the official vllm Docker image or building your own Dockerfile that installs vllm. Use the vllm serve CLI command inside the container to start the model server, then query it via HTTP or OpenAI-compatible API endpoints.

PREREQUISITES

Docker installed (Docker Engine 20.10+)
Python 3.8+ (for custom builds)
Basic Docker CLI knowledge

Setup

Install Docker on your machine from the official site. Pull the vllm Docker image from Docker Hub or build your own image with vllm installed. Set environment variables as needed for model paths and ports.

bash

docker pull vllm/vllm:latest

output

Using default tag: latest
latest: Pulling from vllm/vllm
Digest: sha256:...
Status: Downloaded newer image for vllm/vllm:latest

Step by step

Run the vllm server inside Docker with a command that exposes the HTTP API port and specifies the model to serve. Then query the server using the OpenAI-compatible API endpoint.

bash

docker run --rm -p 8000:8000 vllm/vllm:latest \
  vllm serve meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 8000

output

Starting vLLM server on 0.0.0.0:8000
Loaded model meta-llama/Llama-3.1-8B-Instruct
Waiting for requests...

Querying the server

Use the OpenAI Python SDK with base_url pointing to your local vLLM server to send chat completions requests.

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from vLLM Docker!"}]
)
print(response.choices[0].message.content)

output

Hello from vLLM Docker! How can I assist you today?

Common variations

Build a custom Dockerfile to include additional dependencies or models.
Run vllm with GPU support by using NVIDIA Docker and appropriate CUDA base images.
Use different models by changing the model name in the vllm serve command.
Run vllm asynchronously or with streaming by configuring client SDK calls accordingly.

python

FROM python:3.10-slim
RUN pip install vllm
CMD ["vllm", "serve", "meta-llama/Llama-3.1-8B-Instruct", "--host", "0.0.0.0", "--port", "8000"]

Troubleshooting

If the server does not start, check Docker logs for missing dependencies or model download errors.
For GPU usage, ensure NVIDIA drivers and nvidia-docker2 are installed and the container is run with --gpus all.
If requests time out, verify port mappings and firewall settings.
Use docker logs [container_id] to debug server startup issues.

✅

Key Takeaways

Use the official vLLM Docker image or build your own for flexible deployment.
Expose port 8000 and run the vLLM server with the desired model inside the container.
Query the running vLLM server via OpenAI-compatible API endpoints using the OpenAI SDK.
Enable GPU support by running the container with NVIDIA Docker and CUDA base images.
Check Docker logs and port mappings to troubleshoot common deployment issues.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗