How to Intermediate · 3 min read

How to deploy vLLM on GCP

Quick answer
To deploy vLLM on GCP, provision a GPU-enabled VM instance, install vLLM and dependencies, then start the vLLM server via CLI. Query the running server using the OpenAI SDK with the server's endpoint as base_url.

PREREQUISITES

  • Python 3.8+
  • Google Cloud account with billing enabled
  • gcloud CLI installed and configured
  • pip install vllm openai
  • GPU-enabled GCP VM instance (e.g., NVIDIA Tesla T4 or A100)

Setup GCP VM instance

Create a GPU-enabled VM instance on GCP using the Google Cloud Console or gcloud CLI. Choose a Linux OS (Ubuntu 22.04 recommended) and attach a compatible NVIDIA GPU (e.g., Tesla T4 or A100). Ensure you enable the NVIDIA GPU driver installation and allow SSH access.

bash
gcloud compute instances create vllm-instance \
  --zone=us-central1-a \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --metadata=startup-script='#! /bin/bash
sudo apt-get update
sudo apt-get install -y build-essential python3-pip
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-driver-525
sudo reboot'

Install and run vLLM server

SSH into your VM, install vLLM and dependencies, then start the vLLM server with your chosen model. The server listens on port 8000 by default.

bash
ssh USERNAME@vllm-instance-ip

# Update and install dependencies
sudo apt-get update && sudo apt-get install -y python3-pip

# Install vLLM
pip install vllm

# Start vLLM server with a model (e.g., meta-llama/Llama-3.1-8B-Instruct)
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
output
Serving model meta-llama/Llama-3.1-8B-Instruct on port 8000
Waiting for requests...

Query vLLM server from Python

Use the OpenAI Python SDK to send chat completions requests to your running vLLM server by specifying the base_url parameter pointing to your VM's IP and port.

python
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="http://vllm-instance-ip:8000/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from GCP vLLM!"}]
)

print(response.choices[0].message.content)
output
Hello from GCP vLLM! How can I assist you today?

Common variations and tips

  • Use different models by changing the model name in the vllm serve command and the Python client.
  • For production, consider using a managed Kubernetes cluster with GPU nodes and containerize vLLM.
  • Enable HTTPS and authentication for secure access.
  • Use vllm CLI flags to customize batch size, max tokens, and concurrency.

Key Takeaways

  • Provision a GPU-enabled VM on GCP with NVIDIA drivers for optimal vLLM performance.
  • Run the vLLM server via CLI and query it remotely using the OpenAI SDK with base_url set to your server endpoint.
  • Customize deployment by selecting models, tuning server parameters, and securing access for production use.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗