Tool Advanced hard · 8 min config

PersistentVolume for model weights

What you will learn

Configure Kubernetes PersistentVolumes to serve pre-downloaded vLLM model weights across pod restarts and replicas without re-downloading.

Why this matters

Without a PersistentVolume, every vLLM pod restart re-downloads 8–70GB model weights from Hugging Face (bandwidth cost + 5–15min startup delay). In production with auto-scaling, this multiplies across replicas. A properly configured PV enables instant pod startup, cost predictability, and sub-second inference readiness.

Skip if: Use ephemeral storage if: (1) model fits in container image (rarely, models are huge), (2) you're running single-replica local dev (use <code>docker run -v</code> instead), (3) you're OK with downloading 50GB every pod restart. For batch inference jobs that don't need persistence between runs, consider init containers instead of PVs.

Explanation

A Kubernetes PersistentVolume (PV) is a cluster-wide storage resource decoupled from pod lifecycle. When you pair it with a PersistentVolumeClaim (PVC), any vLLM pod can mount the same filesystem and access pre-downloaded models. vLLM's default behavior is to cache models in ~/.cache/huggingface/hub: by mounting a PV to this path, you ensure all replicas share the same model cache. The first pod to start downloads the model once; subsequent pods read from disk. This pattern scales from 2 replicas to 100 without multiplication of bandwidth or startup time. The storage class (e.g., fast-ssd, ebs-gp3) determines speed: NVMe-backed PVs significantly reduce model load time (20–40% faster inference startup than network storage).

Configuration

yaml

apiVersion: v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
allowVolumeExpansion: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-model-cache
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.19.0
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_NAME
          value: "meta-llama/Llama-3.2-8B-Instruct"
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-token
              key: token
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        resources:
          requests:
            memory: "16Gi"
            cpu: "4"
          limits:
            memory: "24Gi"
            cpu: "8"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: vllm-model-cache

Why this order?

StorageClass must exist before PVC references it. PVC must be created before Deployment references it in volumeMounts. The initialDelaySeconds: 120 in liveness probe gives the first pod time to download models; subsequent pods skip download and fail health checks faster if misconfigured.

Wrong vs Right

Wrong way

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.19.0
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        emptyDir: {}

This uses <code>emptyDir</code>: each pod gets a fresh ephemeral volume. Every restart and every new replica downloads 70GB fresh. With 3 replicas, you download 210GB total even if they all start sequentially.

Right way

yaml

Use the PersistentVolumeClaim shown in config_example. The PVC ensures all 3 replicas mount the same underlying storage. First pod downloads once; pods 2–3 read from cache. Total download: 70GB instead of 210GB.

Tool vitals

Primary command

bash

kubectl apply -f pv-claim.yaml

Config file pv-persistent-volume.yaml, pvc-claim.yaml, deployment.yaml

Verify

bash

kubectl get pv,pvc && kubectl exec <pod-name> -- ls -lh ~/.cache/huggingface/hub/models

Integration notes

Pair this with a StatefulSet (not Deployment) if you need one PVC per replica. Use volumeClaimTemplates in StatefulSet to auto-create PVCs. For multi-region setups, consider snapshot-based replication or external storage (S3) with model caching sidecars. With vLLM's OpenAI-compatible API, this PV setup feeds into load balancers (Kubernetes Service type LoadBalancer) without modification.

Migration path

If storage becomes a bottleneck, migrate from cloud block storage (EBS, GCE PD) to dedicated model serving storage: (1) NFS export model cache from a high-memory node, (2) use Ceph/Rook for distributed block storage, (3) switch to cloud-managed model registries (Replicate, Modal) that handle caching transparently. For massive scale (100+ replicas), consider object storage (S3) with local NVMe caching via Kaniko or DiskCache sidecars.

Cost model

PersistentVolume storage is metered hourly. AWS EBS gp3: $0.10/GB/month for storage + $0.005 per provisioned IOPS. 100Gi with 3000 IOPS ≈ $11/month storage + $18/month IOPS = $29/month. Savings: prevents re-downloading 100Gi × (num restarts/month). With daily rolling updates, that's 70GB × 30 = 2.1TB of saved downloads (~$0.20 in egress from HF). PV pays for itself in ops stability alone.

Common gotcha

PersistentVolumeClaim with accessMode: ReadWriteOnce can only be mounted by ONE pod at a time across the cluster. If you have 3 replicas and one is on a different node, it will hang pending forever because the PVC is already bound to another node. Solution: Use ReadWriteMany (NFS, EFS) if you need true multi-node access, OR use ReadWriteOncePod (Kubernetes 1.29+) with pod affinity to ensure all replicas run on the same node, OR use separate PVCs per replica with StatefulSet and accept duplicated storage.

Team adoption

Day 1: Document that model cache is shared storage, not ephemeral. Enforce StorageClass naming convention (e.g., fast-ssd for model serving, standard for logs). Week 1: Add PVC size limits to prevent runaway storage (quota in Namespace). Month 1: Automate snapshot-to-backup of PV on model updates. Failure mode to test: delete PVC while pods are running: ensure graceful degradation (pods don't crash, restart on new PVC) via init containers that check cache health.

Experienced dev note

Set allowVolumeExpansion: true in StorageClass even if you don't think you'll need it. Growth from 8B to 70B models is inevitable. Without this flag, expanding the PVC fails. Also, add retentionPolicy: Retain to PersistentVolume spec so deleting the Deployment doesn't wipe 100GB of model weights by accident. Experienced teams also snapshot the PV before upgrades: one corruption of the Hugging Face cache index and all pods fail silently.

Check your understanding

If you have 3 vLLM pod replicas using the PersistentVolumeClaim config above, and the first pod downloads a 30GB model in 5 minutes, approximately how long does it take for the second and third replicas to be inference-ready after they start? Why?

Show answer hint

The key is that ReadWriteOnce allows only one pod to mount at a time on the same node. If all 3 replicas are scheduled on the same node (via affinity), they share cache: replicas 2–3 skip download and reach readiness in seconds. If pods spread across nodes without affinity, replica 2 hangs waiting for the PVC to unbind from replica 1's node, causing deployment failure. The real answer depends on your pod affinity and storage access mode.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.