Tool Intermediate medium · 8 min best_practice

Batch inference vs real-time serving: choosing the right pattern

What you will learn

Choose between batch inference (high throughput, delayed results) and real-time serving (low latency, per-request cost) based on your latency, throughput, and cost constraints.

Why this matters

Picking the wrong serving pattern wastes infrastructure spend, causes user-facing latency issues, or creates unnecessary operational complexity. A recommendation system that runs batch inference every 6 hours cannot serve fresh predictions. A fraud detection system running real-time inference on every transaction may cost 10x more than batch alternatives if you can tolerate a 5-minute delay.

Skip if: If you have <100 predictions per day and latency doesn't matter, a simple cron job + SQLite file is sufficient. If your model never changes and predictions are fully deterministic (no feature drift), you can precompute and cache everything.

Explanation

Batch inference and real-time serving represent two opposite ends of the ML deployment spectrum, each with distinct cost, latency, and operational trade-offs. Batch inference processes large volumes of data in a single job (typically hourly, daily, or weekly), storing precomputed results in a database or data warehouse. Real-time serving uses an API endpoint that computes predictions synchronously for each incoming request, usually through a containerized model server like BentoML, vLLM, or KServe. The choice hinges on three dimensions: (1) latency tolerance: can your use case accept predictions computed hours ago?: (2) throughput demand: do you have thousands of concurrent prediction requests, or steady low volume?: (3) feature freshness: do predictions require real-time features that change by the minute, or can you use historical snapshots? Batch inference excels when you know what to predict ahead of time (e-commerce product recommendations, email rankings, weekly customer segmentation). Real-time serving is mandatory when predictions depend on immediate user input (chatbot responses, fraud detection, price optimization on the fly). The infrastructure also differs: batch uses scalable compute (Kubernetes jobs, Spark clusters, DVC pipelines) and stores results durably; real-time uses always-on services with API gateways, load balancers, and health checks.

Configuration

yaml

# BATCH INFERENCE: DVC pipeline for daily prediction jobs
# dvc.yaml
stages:
  prepare_batch_data:
    cmd: python scripts/prepare_batch.py --date $(date +%Y-%m-%d)
    deps:
      - scripts/prepare_batch.py
      - data/raw/
    outs:
      - data/batch_input.parquet:
          hash: md5
  run_batch_inference:
    cmd: |
      python -c "
      import pickle
      import pandas as pd
      model = pickle.load(open('models/model.pkl', 'rb'))
      X = pd.read_parquet('data/batch_input.parquet')
      predictions = model.predict(X)
      pd.DataFrame({'id': X['id'], 'prediction': predictions}).to_parquet('outputs/predictions.parquet', index=False)
      "
    deps:
      - data/batch_input.parquet
      - models/model.pkl
    outs:
      - outputs/predictions.parquet:
          hash: md5
  upload_to_warehouse:
    cmd: python scripts/upload_to_snowflake.py outputs/predictions.parquet
    deps:
      - outputs/predictions.parquet
      - scripts/upload_to_snowflake.py

---
# REAL-TIME SERVING: BentoML service
# service.py
import bentoml
import numpy as np
from pydantic import BaseModel

class PredictionInput(BaseModel):
    features: list[float]

model_ref = bentoml.sklearn.get("iris_classifier:latest")

@bentoml.service(resources={"cpu": "0.5", "memory": "512Mi"})
class IrisClassifier:
    model = model_ref
    
    @bentoml.api
    def predict(self, data: PredictionInput) -> dict:
        X = np.array(data.features).reshape(1, -1)
        prediction = self.model.predict(X)[0]
        return {"prediction": int(prediction)}

---
# REAL-TIME SERVING: Docker Compose for local testing
# docker-compose.yml
version: '3.8'
services:
  bentoml-server:
    image: bentoml-iris:latest
    ports:
      - "3000:3000"
    environment:
      - BENTOML_MODEL_PATH=/models
    volumes:
      - ./models:/models
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/healthz"]
      interval: 10s
      timeout: 5s
      retries: 3
  nginx:
    image: nginx:latest
    ports:
      - "8000:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - bentoml-server

---
# BATCH: Kubernetes CronJob for periodic inference
# batch-inference-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-batch-inference
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        spec:
          serviceAccountName: batch-inference
          containers:
          - name: batch-inference
            image: batch-inference:v1.2.0
            command:
            - python
            - /app/scripts/batch_predict.py
            - --date
            - $(date +%Y-%m-%d)
            env:
            - name: MODEL_REGISTRY_URI
              value: http://mlflow-server:5000
            - name: DATA_WAREHOUSE_HOST
              valueFrom:
                secretKeyRef:
                  name: warehouse-credentials
                  key: host
            resources:
              requests:
                cpu: 2
                memory: 4Gi
              limits:
                cpu: 4
                memory: 8Gi
            volumeMounts:
            - name: model-cache
              mountPath: /cache
          volumes:
          - name: model-cache
            emptyDir: {}
          restartPolicy: OnFailure

---
# REAL-TIME: Dockerfile for model server (nvidia/cuda base)
# Dockerfile.realtime
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    curl \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY service.py .
COPY models/ ./models/
EXPOSE 3000
CMD ["bentoml", "serve", "service:IrisClassifier", "--production"]

Why this order?

The batch pipeline orders stages as: (1) prepare data, (2) run inference, (3) upload results. DVC respects this DAG and only re-runs stages whose dependencies changed. The Kubernetes CronJob uses schedule "0 2 * * *" (2 AM daily) to run after data warehouses finish nightly ETL. The real-time Compose stack starts BentoML first, then Nginx as a reverse proxy, because Nginx's depends_on ensures the model server is healthy before traffic arrives.

Wrong vs Right

Wrong way

yaml

# WRONG: Batch inference without DVC tracking: no reproducibility
cd /data
python predict.py  # Runs but no lineage, no way to re-run from a specific date
cp output.csv /warehouse/  # Manual file copy, easy to forget

# WRONG: Real-time serving without health checks: silent failures
docker run -p 8000:5000 model-server:v1
# Container crashes but Kubernetes doesn't know; requests fail silently

# WRONG: Single real-time endpoint for 1M daily predictions: massive cost
curl -X POST http://model-server/predict -d '{"features": [1,2,3]}'  # Called 1M times/day
# Each request spins up inference; costs scale linearly with traffic instead of computing once

Right way

yaml

# RIGHT: Batch pipeline with DVC: reproducible and versioned
cd /data && dvc repro --date 2026-04-15
# DVC checks deps, re-runs only changed stages, logs to MLflow
mlflow ui  # View experiment runs

# RIGHT: Real-time with health checks and limits
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/healthz"]
  interval: 10s
  retries: 3
# Kubernetes kills unhealthy containers and respins

# RIGHT: Batch predictions for high volume: compute once, query many times
# Schedule DVC pipeline daily, store in database
SELECT prediction FROM predictions WHERE user_id = 42 AND date = CURRENT_DATE
# Cost: one inference per user per day, not per page view

Tool vitals

Primary command

bash

docker run (real-time) or dvc stage run (batch)

Config file docker-compose.yml or dvc.yaml

Verify

bash

curl http://localhost:8000/predict (real-time) or dvc dag (batch)

Integration notes

Batch pipelines integrate with DVC (data versioning), MLflow (experiment tracking), and Kubernetes (scheduling). DVC tracks input data and model versions; MLflow logs metrics for each batch run. Real-time serving integrates with model registries (MLflow, BentoML's built-in registry) and API gateways (Kong, Istio) for routing and rate limiting. Use MLflow's model registry to pull the production model version into both batch jobs (via mlflow.sklearn.load_model()) and real-time services (via BentoML's import from MLflow).

Migration path

Starting with batch? As latency requirements tighten (e.g., user-facing features), build a real-time service alongside batch. Use feature flags to gradually route traffic from batch cache lookups to real-time endpoints. Once real-time proves stable, eliminate batch for that feature. Conversely, if real-time inference costs spiral, segment use cases: keep high-frequency, low-latency features real-time; move medium-frequency features to batch. Use a feature store (Feast, Tecton) as an abstraction layer so you can switch serving patterns without rewriting application code.

Cost model

Batch inference: compute cost is predictable (fixed job duration × instance size); storage is cheap (one set of precomputed results per batch). Real-time serving: cost scales with traffic (more requests = more container instances or GPU time). A batch job inferring 1M records costs ~$0.10–$1.00 (1–10 GPU minutes). Real-time serving 1M predictions via API could cost $10–$100 depending on model latency and parallelism. MLflow and DVC are free open-source; enterprise versions add cost.

Common gotcha

In Kubernetes CronJob for batch inference, if you use $(date +%Y-%m-%d) in the command field, it expands on the control plane node's time zone, not UTC. Use --date $(date -u +%Y-%m-%d) to ensure consistent scheduling across time zones, or better: pass the job's creation timestamp via the downward API. For real-time serving, forgetting to set restartPolicy: OnFailure means a crashed inference server never restarts in Kubernetes unless wrapped in a Deployment.

Team adoption

Start by documenting your latency and throughput requirements in a decision matrix (5-column: Feature, Latency SLA, Throughput, Freshness Need, Current Pattern). Use this to classify features as Batch, Real-Time, or Hybrid. Enforce via code review: batch pipelines must be wrapped in DVC or Airflow; real-time services must include healthchecks. Pair batch and real-time owners for the first 2–3 features to share patterns. Make it easy to switch: use a feature store or model gateway so application teams don't re-engineer when serving patterns change.

Experienced dev note

After weeks of batch inference, experienced teams discover that dvc stage run --no-commit lets you test pipeline logic without updating .dvc lock files: critical for debugging failed stages. For real-time, they learn to set BENTOML_RUNNERS_CPU_STRICT_MODE=false in production: Bentoml's strict CPU pinning can cause context-switch overhead that kills throughput on multi-tenant clusters. Also: always log prediction latencies (p50, p95, p99) to Prometheus; a sudden spike in p99 (e.g., 200ms → 2s) signals model size bloat or data skew before it becomes a user-facing incident.

Check your understanding

You have a recommendation system that receives 500 million events per day from users browsing products. Rankings need to refresh every 6 hours to reflect trending items. Predictions are made on a static pool of products. Why is batch inference preferred over real-time serving here, and what would have to change to make real-time necessary?

Show answer hint

Batch works because (1) you know all products ahead of time (no unbounded prediction space), (2) you can precompute and cache results for 6 hours, (3) 500M predictions via API would cost far more than one batch job. Real-time becomes necessary if latency drops below 6 hours (e.g., "show top 10 now") or if the prediction pool is dynamic (e.g., personalized products per user, requiring user-specific features that change mid-session).

Community Notes

No notes yetBe the first to share a version-specific fix or tip.