How to Intermediate · 3 min read

How to do shadow deployment for ML models

Quick answer

Shadow deployment for ML models involves routing live production traffic to a new model version in parallel with the current model without affecting user responses. This allows you to compare outputs and monitor performance before fully switching over. Use techniques like traffic mirroring and logging predictions from both models for analysis.

PREREQUISITES

Python 3.8+
Basic knowledge of ML model serving
Access to a model serving platform or API
Logging or monitoring infrastructure (e.g., Datadog, Prometheus)
pip install requests

Setup

Install the requests library to simulate API calls to your ML models. Ensure you have access to both the current production model endpoint and the new shadow model endpoint. Set environment variables for these endpoints to keep your code clean and secure.

bash

pip install requests

Step by step

This example demonstrates a simple shadow deployment where incoming requests are sent to the production model for real user responses, while the same requests are also sent to the shadow model for logging and comparison. The shadow model's output is not returned to users.

python

import os
import requests
import json

PROD_MODEL_URL = os.environ["PROD_MODEL_URL"]
SHADOW_MODEL_URL = os.environ["SHADOW_MODEL_URL"]

# Simulate a user request payload
user_input = {"text": "What is the weather today?"}

# Send request to production model
prod_response = requests.post(PROD_MODEL_URL, json=user_input)
prod_output = prod_response.json()

# Send request to shadow model asynchronously (fire and forget)
try:
    shadow_response = requests.post(SHADOW_MODEL_URL, json=user_input, timeout=0.5)
    shadow_output = shadow_response.json()
except requests.exceptions.RequestException:
    shadow_output = None

# Log both outputs for offline analysis
print("Production model output:", json.dumps(prod_output))
print("Shadow model output:", json.dumps(shadow_output))

# Return production output to user
print("Response to user:", prod_output)

output

Production model output: {"answer": "It's sunny and 75 degrees."}
Shadow model output: {"answer": "The weather is sunny with a temperature of 75°F."}
Response to user: {"answer": "It's sunny and 75 degrees."}

Common variations

Async calls: Use asynchronous HTTP clients like httpx or aiohttp to avoid blocking production responses.
Streaming: For streaming models, mirror streams to shadow models with buffering.
Different SDKs: Use cloud provider SDKs (AWS SageMaker, Azure ML) to route traffic for shadow deployments.
Shadow traffic percentage: Instead of mirroring all traffic, send a sample (e.g., 10%) to the shadow model.

Troubleshooting

If shadow model calls time out, reduce timeout or run calls asynchronously to avoid slowing production.
If logs show mismatched inputs, verify payload serialization matches both models' expected formats.
If shadow model crashes, isolate it from production traffic and fix bugs before resuming shadow deployment.
Monitor latency impact carefully to ensure shadow deployment does not degrade user experience.

✅

Key Takeaways

Shadow deployment routes live traffic to a new model without affecting user responses for safe testing.
Log and compare outputs from production and shadow models to detect regressions or improvements.
Use asynchronous calls or traffic sampling to minimize latency impact on production.
Monitor shadow model performance and errors separately to avoid production disruptions.

Verified 2026-04

Verify ↗