Auto-scaling inference endpoints: when and how
Why this matters
Without autoscaling, you either overprovision (wasting money on idle replicas) or underprovision (requests timeout during traffic spikes). HPA bridges this gap by scaling based on actual demand: critical for production inference workloads where traffic patterns are unpredictable.
Explanation
Kubernetes HPA automatically scales the number of Pod replicas up or down by monitoring metrics. The most common metric is CPU utilization (e.g., scale up when average CPU exceeds 70%), but you can also scale on memory, request latency, or custom metrics from your inference framework. HPA makes scaling decisions every 15 seconds by default, comparing current metric values against target thresholds. When scaling up, new Pod replicas are created; when scaling down, replicas are gracefully terminated. This is especially important for inference endpoints because prediction workloads are bursty: a batch inference job or a sudden traffic spike can cause CPU to spike, and HPA responds by adding capacity. The key difference from manual scaling is that HPA is reactive (responds to demand) and automatic (no human intervention), which is essential for production systems that must handle variable load.
Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-endpoint-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 2
periodSeconds: 30
selectPolicy: Max Why this order?
The scaleTargetRef must come first to identify which Deployment to scale. minReplicas/maxReplicas set absolute bounds before any metric thresholds are evaluated. Metrics are evaluated in sequence: if any metric triggers a scale event, HPA acts. The behavior section (scaleDown/scaleUp) applies last, controlling rate and stability to prevent flapping (rapid up/down oscillations).
Wrong vs Right
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: inference-endpoint-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-service
minReplicas: 1
maxReplicas: 5
targetCPUUtilizationPercentage: 80 Use autoscaling/v2 instead of v1. v1 only supports CPU-based scaling; v2 supports CPU, memory, and custom metrics. Set minReplicas: 2 (avoid single replicas for availability). Use behavior section to prevent scale-flapping (the oscillation where HPA scales up and down rapidly). Include memory targets alongside CPU: CPU alone misses memory-bound workloads. Define explicit scaleUp and scaleDown policies with stabilizationWindowSeconds to smooth out noise in metrics. Tool vitals
kubectl apply -f hpa.yaml hpa.yaml (HorizontalPodAutoscaler manifest) kubectl get hpa -o wide && kubectl describe hpa <hpa-name> Integration notes
HPA works in concert with your inference Deployment (e.g., a BentoML or vLLM service). The Deployment defines the Pod template (container image, resource requests); HPA watches it and adds/removes replicas. For custom metrics (e.g., inference queue depth), integrate with Prometheus or your inference framework's metrics endpoint (BentoML exposes /metrics on port 3000). If using MLflow Model Registry for versioning, ensure your Deployment's image tag points to the MLflow artifact store version you want to scale.
Migration path
If you move away from Kubernetes (e.g., to AWS SageMaker Endpoints or Google Cloud Vertex AI), those platforms provide native autoscaling via their APIs: no YAML needed. If moving to a serverless inference platform (AWS Lambda, Google Cloud Run), autoscaling is implicit; requests are handled by the platform without manual HPA configuration.
Cost model
HPA itself is free: it's a Kubernetes native controller. However, scaling up adds Pod replicas, and each replica consumes resources (CPU, memory, storage) which incur cloud costs. Calculate: if max replicas is 10 and each replica uses 1 CPU and 2GB RAM, at cloud prices (e.g., $0.05/CPU-hour, $0.01/GB-hour) your max monthly cost is ~$400/month per deployment. minReplicas: 2 ensures a baseline cost floor. Monitor actual vs. predicted scale patterns to tune maxReplicas and avoid unnecessary scaling.
Common gotcha
HPA requires the Metrics Server to be installed in your cluster (kubectl get deployment metrics-server -n kube-system). Without it, HPA will report "unknown" for metrics and never scale. Also, if your Deployment Pod template does NOT specify resource requests (cpu: and memory: in containers[].resources.requests), HPA cannot calculate utilization percentage and will fail silently. Always include resource requests in your Deployment: without them, HPA has no baseline to measure against.
Team adoption
1. Ensure Metrics Server is installed: kubectl get deployment metrics-server -n kube-system. If missing, helm install metrics-server metrics-server/metrics-server -n kube-system. 2. Create a standard HPA template in your team's Helm charts or kustomize base with sensible defaults (minReplicas: 2, maxReplicas: 10, CPU target: 70%, memory target: 80%). 3. Document resource requests in your Deployment template: without them, HPA cannot function. Add this to your deployment checklist. 4. Set up alerts on HPA events (kubectl create event watcher) to notify the team when an endpoint is scaling frequently, indicating potential resource overprovisioning or underprovisioning. 5. Run load tests against your inference service to establish realistic max-replica and stabilization window settings before production rollout.
Experienced dev note
Set stabilizationWindowSeconds: 0 for scaleUp and stabilizationWindowSeconds: 300 for scaleDown. This asymmetry lets you respond to traffic spikes immediately (fast scale-up) while preventing the thrashing of rapid scale-downs during natural metric fluctuations. Most teams discover this after weeks of HPA oscillating between 2 and 10 replicas every few minutes. Also, use behavior.policies with both Percent and Pods types: Percent handles growth proportionally, Pods adds a fixed increment for those last critical replicas.
Check your understanding
Your inference service's CPU usage spikes to 85% for 10 seconds (a request batch arrives), then drops to 40%. Why might HPA not scale up immediately, and why is that behavior actually desirable?
Show answer hint
HPA averages metrics over its evaluation period (15 seconds by default). A brief spike doesn't trigger sustained scaling. Also, stabilizationWindowSeconds prevents reacting to momentary noise. Scaling up on every blip would waste resources (cold-start time, scheduling overhead). HPA waits for sustained high utilization to confirm sustained demand before adding replicas.