PersistentVolume for model weights
Why this matters
Without a PersistentVolume, every vLLM pod restart re-downloads 8–70GB model weights from Hugging Face (bandwidth cost + 5–15min startup delay). In production with auto-scaling, this multiplies across replicas. A properly configured PV enables instant pod startup, cost predictability, and sub-second inference readiness.
Explanation
A Kubernetes PersistentVolume (PV) is a cluster-wide storage resource decoupled from pod lifecycle. When you pair it with a PersistentVolumeClaim (PVC), any vLLM pod can mount the same filesystem and access pre-downloaded models. vLLM's default behavior is to cache models in ~/.cache/huggingface/hub: by mounting a PV to this path, you ensure all replicas share the same model cache. The first pod to start downloads the model once; subsequent pods read from disk. This pattern scales from 2 replicas to 100 without multiplication of bandwidth or startup time. The storage class (e.g., fast-ssd, ebs-gp3) determines speed: NVMe-backed PVs significantly reduce model load time (20–40% faster inference startup than network storage).
Configuration
apiVersion: v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "3000"
throughput: "125"
allowVolumeExpansion: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-model-cache
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 3
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.19.0
ports:
- containerPort: 8000
env:
- name: MODEL_NAME
value: "meta-llama/Llama-3.2-8B-Instruct"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
resources:
requests:
memory: "16Gi"
cpu: "4"
limits:
memory: "24Gi"
cpu: "8"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache Why this order?
StorageClass must exist before PVC references it. PVC must be created before Deployment references it in volumeMounts. The initialDelaySeconds: 120 in liveness probe gives the first pod time to download models; subsequent pods skip download and fail health checks faster if misconfigured.
Wrong vs Right
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 3
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.19.0
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
emptyDir: {}
This uses <code>emptyDir</code>: each pod gets a fresh ephemeral volume. Every restart and every new replica downloads 70GB fresh. With 3 replicas, you download 210GB total even if they all start sequentially. Use the PersistentVolumeClaim shown in config_example. The PVC ensures all 3 replicas mount the same underlying storage. First pod downloads once; pods 2–3 read from cache. Total download: 70GB instead of 210GB. Tool vitals
kubectl apply -f pv-claim.yaml pv-persistent-volume.yaml, pvc-claim.yaml, deployment.yaml kubectl get pv,pvc && kubectl exec <pod-name> -- ls -lh ~/.cache/huggingface/hub/models Integration notes
Pair this with a StatefulSet (not Deployment) if you need one PVC per replica. Use volumeClaimTemplates in StatefulSet to auto-create PVCs. For multi-region setups, consider snapshot-based replication or external storage (S3) with model caching sidecars. With vLLM's OpenAI-compatible API, this PV setup feeds into load balancers (Kubernetes Service type LoadBalancer) without modification.
Migration path
If storage becomes a bottleneck, migrate from cloud block storage (EBS, GCE PD) to dedicated model serving storage: (1) NFS export model cache from a high-memory node, (2) use Ceph/Rook for distributed block storage, (3) switch to cloud-managed model registries (Replicate, Modal) that handle caching transparently. For massive scale (100+ replicas), consider object storage (S3) with local NVMe caching via Kaniko or DiskCache sidecars.
Cost model
PersistentVolume storage is metered hourly. AWS EBS gp3: $0.10/GB/month for storage + $0.005 per provisioned IOPS. 100Gi with 3000 IOPS ≈ $11/month storage + $18/month IOPS = $29/month. Savings: prevents re-downloading 100Gi × (num restarts/month). With daily rolling updates, that's 70GB × 30 = 2.1TB of saved downloads (~$0.20 in egress from HF). PV pays for itself in ops stability alone.
Common gotcha
PersistentVolumeClaim with accessMode: ReadWriteOnce can only be mounted by ONE pod at a time across the cluster. If you have 3 replicas and one is on a different node, it will hang pending forever because the PVC is already bound to another node. Solution: Use ReadWriteMany (NFS, EFS) if you need true multi-node access, OR use ReadWriteOncePod (Kubernetes 1.29+) with pod affinity to ensure all replicas run on the same node, OR use separate PVCs per replica with StatefulSet and accept duplicated storage.
Team adoption
Day 1: Document that model cache is shared storage, not ephemeral. Enforce StorageClass naming convention (e.g., fast-ssd for model serving, standard for logs). Week 1: Add PVC size limits to prevent runaway storage (quota in Namespace). Month 1: Automate snapshot-to-backup of PV on model updates. Failure mode to test: delete PVC while pods are running: ensure graceful degradation (pods don't crash, restart on new PVC) via init containers that check cache health.
Experienced dev note
Set allowVolumeExpansion: true in StorageClass even if you don't think you'll need it. Growth from 8B to 70B models is inevitable. Without this flag, expanding the PVC fails. Also, add retentionPolicy: Retain to PersistentVolume spec so deleting the Deployment doesn't wipe 100GB of model weights by accident. Experienced teams also snapshot the PV before upgrades: one corruption of the Hugging Face cache index and all pods fail silently.
Check your understanding
If you have 3 vLLM pod replicas using the PersistentVolumeClaim config above, and the first pod downloads a 30GB model in 5 minutes, approximately how long does it take for the second and third replicas to be inference-ready after they start? Why?
Show answer hint
The key is that ReadWriteOnce allows only one pod to mount at a time on the same node. If all 3 replicas are scheduled on the same node (via affinity), they share cache: replicas 2–3 skip download and reach readiness in seconds. If pods spread across nodes without affinity, replica 2 hangs waiting for the PVC to unbind from replica 1's node, causing deployment failure. The real answer depends on your pod affinity and storage access mode.