Tool Intermediate medium · 8 min config

Auto-scaling inference endpoints: when and how

What you will learn

Configure Kubernetes Horizontal Pod Autoscaler (HPA) to automatically add or remove inference replicas based on CPU, memory, or custom metrics.

Why this matters

Without autoscaling, you either overprovision (wasting money on idle replicas) or underprovision (requests timeout during traffic spikes). HPA bridges this gap by scaling based on actual demand: critical for production inference workloads where traffic patterns are unpredictable.

Skip if: Use fixed replicas if: traffic is perfectly stable and predictable (rare), you have a small team without observability infrastructure, or you're in a development/staging environment with minimal load. For single-node K8s clusters or serverless platforms (AWS Lambda, Google Cloud Functions), consider those platforms' native autoscaling instead of managing HPA yourself.

Explanation

Kubernetes HPA automatically scales the number of Pod replicas up or down by monitoring metrics. The most common metric is CPU utilization (e.g., scale up when average CPU exceeds 70%), but you can also scale on memory, request latency, or custom metrics from your inference framework. HPA makes scaling decisions every 15 seconds by default, comparing current metric values against target thresholds. When scaling up, new Pod replicas are created; when scaling down, replicas are gracefully terminated. This is especially important for inference endpoints because prediction workloads are bursty: a batch inference job or a sudden traffic spike can cause CPU to spike, and HPA responds by adding capacity. The key difference from manual scaling is that HPA is reactive (responds to demand) and automatic (no human intervention), which is essential for production systems that must handle variable load.

Configuration

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-endpoint-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

Why this order?

The scaleTargetRef must come first to identify which Deployment to scale. minReplicas/maxReplicas set absolute bounds before any metric thresholds are evaluated. Metrics are evaluated in sequence: if any metric triggers a scale event, HPA acts. The behavior section (scaleDown/scaleUp) applies last, controlling rate and stability to prevent flapping (rapid up/down oscillations).

Wrong vs Right

Wrong way

yaml

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: inference-endpoint-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-service
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80

Right way

yaml

Use autoscaling/v2 instead of v1. v1 only supports CPU-based scaling; v2 supports CPU, memory, and custom metrics. Set minReplicas: 2 (avoid single replicas for availability). Use behavior section to prevent scale-flapping (the oscillation where HPA scales up and down rapidly). Include memory targets alongside CPU: CPU alone misses memory-bound workloads. Define explicit scaleUp and scaleDown policies with stabilizationWindowSeconds to smooth out noise in metrics.

Tool vitals

Primary command

bash

kubectl apply -f hpa.yaml

Config file hpa.yaml (HorizontalPodAutoscaler manifest)

Verify

bash

kubectl get hpa -o wide && kubectl describe hpa <hpa-name>

Integration notes

HPA works in concert with your inference Deployment (e.g., a BentoML or vLLM service). The Deployment defines the Pod template (container image, resource requests); HPA watches it and adds/removes replicas. For custom metrics (e.g., inference queue depth), integrate with Prometheus or your inference framework's metrics endpoint (BentoML exposes /metrics on port 3000). If using MLflow Model Registry for versioning, ensure your Deployment's image tag points to the MLflow artifact store version you want to scale.

Migration path

If you move away from Kubernetes (e.g., to AWS SageMaker Endpoints or Google Cloud Vertex AI), those platforms provide native autoscaling via their APIs: no YAML needed. If moving to a serverless inference platform (AWS Lambda, Google Cloud Run), autoscaling is implicit; requests are handled by the platform without manual HPA configuration.

Cost model

HPA itself is free: it's a Kubernetes native controller. However, scaling up adds Pod replicas, and each replica consumes resources (CPU, memory, storage) which incur cloud costs. Calculate: if max replicas is 10 and each replica uses 1 CPU and 2GB RAM, at cloud prices (e.g., $0.05/CPU-hour, $0.01/GB-hour) your max monthly cost is ~$400/month per deployment. minReplicas: 2 ensures a baseline cost floor. Monitor actual vs. predicted scale patterns to tune maxReplicas and avoid unnecessary scaling.

Common gotcha

HPA requires the Metrics Server to be installed in your cluster (kubectl get deployment metrics-server -n kube-system). Without it, HPA will report "unknown" for metrics and never scale. Also, if your Deployment Pod template does NOT specify resource requests (cpu: and memory: in containers[].resources.requests), HPA cannot calculate utilization percentage and will fail silently. Always include resource requests in your Deployment: without them, HPA has no baseline to measure against.

Team adoption

1. Ensure Metrics Server is installed: kubectl get deployment metrics-server -n kube-system. If missing, helm install metrics-server metrics-server/metrics-server -n kube-system. 2. Create a standard HPA template in your team's Helm charts or kustomize base with sensible defaults (minReplicas: 2, maxReplicas: 10, CPU target: 70%, memory target: 80%). 3. Document resource requests in your Deployment template: without them, HPA cannot function. Add this to your deployment checklist. 4. Set up alerts on HPA events (kubectl create event watcher) to notify the team when an endpoint is scaling frequently, indicating potential resource overprovisioning or underprovisioning. 5. Run load tests against your inference service to establish realistic max-replica and stabilization window settings before production rollout.

Experienced dev note

Set stabilizationWindowSeconds: 0 for scaleUp and stabilizationWindowSeconds: 300 for scaleDown. This asymmetry lets you respond to traffic spikes immediately (fast scale-up) while preventing the thrashing of rapid scale-downs during natural metric fluctuations. Most teams discover this after weeks of HPA oscillating between 2 and 10 replicas every few minutes. Also, use behavior.policies with both Percent and Pods types: Percent handles growth proportionally, Pods adds a fixed increment for those last critical replicas.

Check your understanding

Your inference service's CPU usage spikes to 85% for 10 seconds (a request batch arrives), then drops to 40%. Why might HPA not scale up immediately, and why is that behavior actually desirable?

Show answer hint

HPA averages metrics over its evaluation period (15 seconds by default). A brief spike doesn't trigger sustained scaling. Also, stabilizationWindowSeconds prevents reacting to momentary noise. Scaling up on every blip would waste resources (cold-start time, scheduling overhead). HPA waits for sustained high utilization to confirm sustained demand before adding replicas.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.