How to Intermediate · 3 min read

How to deploy vLLM on GCP

Q: How to deploy vLLM on GCP

To deploy vLLM on GCP, provision a GPU-enabled VM instance, install vLLM and dependencies, then start the vLLM server via CLI. Query the running server using the OpenAI SDK with the server's endpoint as base_url.

Quick answer

To deploy vLLM on GCP, provision a GPU-enabled VM instance, install vLLM and dependencies, then start the vLLM server via CLI. Query the running server using the OpenAI SDK with the server's endpoint as base_url.

PREREQUISITES

Python 3.8+
Google Cloud account with billing enabled
gcloud CLI installed and configured
pip install vllm openai
GPU-enabled GCP VM instance (e.g., NVIDIA Tesla T4 or A100)

Setup GCP VM instance

Create a GPU-enabled VM instance on GCP using the Google Cloud Console or gcloud CLI. Choose a Linux OS (Ubuntu 22.04 recommended) and attach a compatible NVIDIA GPU (e.g., Tesla T4 or A100). Ensure you enable the NVIDIA GPU driver installation and allow SSH access.

bash

gcloud compute instances create vllm-instance \
  --zone=us-central1-a \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --metadata=startup-script='#! /bin/bash
sudo apt-get update
sudo apt-get install -y build-essential python3-pip
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-driver-525
sudo reboot'

Install and run vLLM server

SSH into your VM, install vLLM and dependencies, then start the vLLM server with your chosen model. The server listens on port 8000 by default.

bash

ssh USERNAME@vllm-instance-ip

# Update and install dependencies
sudo apt-get update && sudo apt-get install -y python3-pip

# Install vLLM
pip install vllm

# Start vLLM server with a model (e.g., meta-llama/Llama-3.1-8B-Instruct)
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

output

Serving model meta-llama/Llama-3.1-8B-Instruct on port 8000
Waiting for requests...

Query vLLM server from Python

Use the OpenAI Python SDK to send chat completions requests to your running vLLM server by specifying the base_url parameter pointing to your VM's IP and port.

python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="http://vllm-instance-ip:8000/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from GCP vLLM!"}]
)

print(response.choices[0].message.content)

output

Hello from GCP vLLM! How can I assist you today?

Common variations and tips

Use different models by changing the model name in the vllm serve command and the Python client.
For production, consider using a managed Kubernetes cluster with GPU nodes and containerize vLLM.
Enable HTTPS and authentication for secure access.
Use vllm CLI flags to customize batch size, max tokens, and concurrency.

Key Takeaways

Provision a GPU-enabled VM on GCP with NVIDIA drivers for optimal vLLM performance.
Run the vLLM server via CLI and query it remotely using the OpenAI SDK with base_url set to your server endpoint.
Customize deployment by selecting models, tuning server parameters, and securing access for production use.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.