How to deploy vLLM on GCP
Quick answer
To deploy
vLLM on GCP, provision a GPU-enabled VM instance, install vLLM and dependencies, then start the vLLM server via CLI. Query the running server using the OpenAI SDK with the server's endpoint as base_url.PREREQUISITES
Python 3.8+Google Cloud account with billing enabledgcloud CLI installed and configuredpip install vllm openaiGPU-enabled GCP VM instance (e.g., NVIDIA Tesla T4 or A100)
Setup GCP VM instance
Create a GPU-enabled VM instance on GCP using the Google Cloud Console or gcloud CLI. Choose a Linux OS (Ubuntu 22.04 recommended) and attach a compatible NVIDIA GPU (e.g., Tesla T4 or A100). Ensure you enable the NVIDIA GPU driver installation and allow SSH access.
gcloud compute instances create vllm-instance \
--zone=us-central1-a \
--machine-type=n1-standard-8 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--maintenance-policy=TERMINATE \
--restart-on-failure \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--metadata=startup-script='#! /bin/bash
sudo apt-get update
sudo apt-get install -y build-essential python3-pip
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-driver-525
sudo reboot' Install and run vLLM server
SSH into your VM, install vLLM and dependencies, then start the vLLM server with your chosen model. The server listens on port 8000 by default.
ssh USERNAME@vllm-instance-ip
# Update and install dependencies
sudo apt-get update && sudo apt-get install -y python3-pip
# Install vLLM
pip install vllm
# Start vLLM server with a model (e.g., meta-llama/Llama-3.1-8B-Instruct)
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 output
Serving model meta-llama/Llama-3.1-8B-Instruct on port 8000 Waiting for requests...
Query vLLM server from Python
Use the OpenAI Python SDK to send chat completions requests to your running vLLM server by specifying the base_url parameter pointing to your VM's IP and port.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["OPENAI_API_KEY"],
base_url="http://vllm-instance-ip:8000/v1"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello from GCP vLLM!"}]
)
print(response.choices[0].message.content) output
Hello from GCP vLLM! How can I assist you today?
Common variations and tips
- Use different models by changing the model name in the
vllm servecommand and the Python client. - For production, consider using a managed Kubernetes cluster with GPU nodes and containerize
vLLM. - Enable HTTPS and authentication for secure access.
- Use
vllmCLI flags to customize batch size, max tokens, and concurrency.
Key Takeaways
- Provision a GPU-enabled VM on GCP with NVIDIA drivers for optimal
vLLMperformance. - Run the
vLLMserver via CLI and query it remotely using theOpenAISDK withbase_urlset to your server endpoint. - Customize deployment by selecting models, tuning server parameters, and securing access for production use.