How to Intermediate · 3 min read

How to deploy vLLM on AWS

Q: How to deploy vLLM on AWS

Deploy vLLM on AWS by launching an EC2 instance, installing Docker, and running the vLLM server container. Then query the server using the openai Python SDK with base_url pointing to your instance's endpoint.

Quick answer

Deploy vLLM on AWS by launching an EC2 instance, installing Docker, and running the vLLM server container. Then query the server using the openai Python SDK with base_url pointing to your instance's endpoint.

PREREQUISITES

Python 3.8+
AWS account with EC2 permissions
Docker installed on EC2 instance
pip install openai>=1.0
OpenAI API key (for querying local vLLM server, optional)

Setup AWS EC2 instance

Launch an AWS EC2 instance with a GPU (e.g., g4dn.xlarge) for efficient vLLM inference. Use an Ubuntu 22.04 AMI and configure security groups to allow inbound TCP on port 8000 for HTTP API access.

SSH into the instance and install Docker:

bash

sudo apt update && sudo apt install -y docker.io
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER
newgrp docker

output

Docker installed and running

Step by step deployment

Pull and run the official vLLM Docker image to serve the model locally on port 8000:

bash

docker pull vllm/vllm:latest

docker run -d -p 8000:8000 vllm/vllm:latest --model meta-llama/Llama-3.1-8B-Instruct

output

Container running, serving vLLM on port 8000

Query vLLM server with Python

Use the openai Python SDK to send prompts to your running vLLM server by specifying the base_url of your EC2 instance. Replace YOUR_EC2_PUBLIC_IP with your instance's public IP address.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://YOUR_EC2_PUBLIC_IP:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a Python function to reverse a string."}]
)

print(response.choices[0].message.content)

output

def reverse_string(s):
    return s[::-1]

Common variations

Use different instance types (e.g., p3.2xlarge) for larger models.
Run vLLM with custom config files for batching and latency tuning.
Use HTTPS with a reverse proxy like Nginx for secure access.
Run vLLM server asynchronously and query with async Python clients.

Troubleshooting

If you cannot connect to port 8000, check your EC2 security group inbound rules.
If Docker container fails to start, verify GPU drivers and Docker NVIDIA runtime are installed.
For slow responses, increase instance GPU memory or optimize vLLM batching parameters.

✅

Key Takeaways

Use GPU-enabled EC2 instances for efficient vLLM inference.
Run vLLM in Docker on EC2 and expose port 8000 for API access.
Query the vLLM server with the OpenAI Python SDK using the base_url parameter.
Configure security groups and GPU drivers properly to avoid connectivity and runtime issues.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗