How to deploy vLLM on AWS
Quick answer
Deploy
vLLM on AWS by launching an EC2 instance, installing Docker, and running the vLLM server container. Then query the server using the openai Python SDK with base_url pointing to your instance's endpoint.PREREQUISITES
Python 3.8+AWS account with EC2 permissionsDocker installed on EC2 instancepip install openai>=1.0OpenAI API key (for querying local vLLM server, optional)
Setup AWS EC2 instance
Launch an AWS EC2 instance with a GPU (e.g., g4dn.xlarge) for efficient vLLM inference. Use an Ubuntu 22.04 AMI and configure security groups to allow inbound TCP on port 8000 for HTTP API access.
SSH into the instance and install Docker:
sudo apt update && sudo apt install -y docker.io
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER
newgrp docker output
Docker installed and running
Step by step deployment
Pull and run the official vLLM Docker image to serve the model locally on port 8000:
docker pull vllm/vllm:latest
docker run -d -p 8000:8000 vllm/vllm:latest --model meta-llama/Llama-3.1-8B-Instruct output
Container running, serving vLLM on port 8000
Query vLLM server with Python
Use the openai Python SDK to send prompts to your running vLLM server by specifying the base_url of your EC2 instance. Replace YOUR_EC2_PUBLIC_IP with your instance's public IP address.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://YOUR_EC2_PUBLIC_IP:8000/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a Python function to reverse a string."}]
)
print(response.choices[0].message.content) output
def reverse_string(s):
return s[::-1] Common variations
- Use different instance types (e.g.,
p3.2xlarge) for larger models. - Run
vLLMwith custom config files for batching and latency tuning. - Use HTTPS with a reverse proxy like Nginx for secure access.
- Run
vLLMserver asynchronously and query with async Python clients.
Troubleshooting
- If you cannot connect to port 8000, check your EC2 security group inbound rules.
- If Docker container fails to start, verify GPU drivers and Docker NVIDIA runtime are installed.
- For slow responses, increase instance GPU memory or optimize
vLLMbatching parameters.
Key Takeaways
- Use GPU-enabled EC2 instances for efficient vLLM inference.
- Run vLLM in Docker on EC2 and expose port 8000 for API access.
- Query the vLLM server with the OpenAI Python SDK using the base_url parameter.
- Configure security groups and GPU drivers properly to avoid connectivity and runtime issues.