How to do shadow deployment for ML models
Quick answer
Shadow deployment for ML models involves routing live production traffic to a new model version in parallel with the current model without affecting user responses. This allows you to compare outputs and monitor performance before fully switching over. Use techniques like traffic mirroring and logging predictions from both models for analysis.
PREREQUISITES
Python 3.8+Basic knowledge of ML model servingAccess to a model serving platform or APILogging or monitoring infrastructure (e.g., Datadog, Prometheus)pip install requests
Setup
Install the requests library to simulate API calls to your ML models. Ensure you have access to both the current production model endpoint and the new shadow model endpoint. Set environment variables for these endpoints to keep your code clean and secure.
pip install requests Step by step
This example demonstrates a simple shadow deployment where incoming requests are sent to the production model for real user responses, while the same requests are also sent to the shadow model for logging and comparison. The shadow model's output is not returned to users.
import os
import requests
import json
PROD_MODEL_URL = os.environ["PROD_MODEL_URL"]
SHADOW_MODEL_URL = os.environ["SHADOW_MODEL_URL"]
# Simulate a user request payload
user_input = {"text": "What is the weather today?"}
# Send request to production model
prod_response = requests.post(PROD_MODEL_URL, json=user_input)
prod_output = prod_response.json()
# Send request to shadow model asynchronously (fire and forget)
try:
shadow_response = requests.post(SHADOW_MODEL_URL, json=user_input, timeout=0.5)
shadow_output = shadow_response.json()
except requests.exceptions.RequestException:
shadow_output = None
# Log both outputs for offline analysis
print("Production model output:", json.dumps(prod_output))
print("Shadow model output:", json.dumps(shadow_output))
# Return production output to user
print("Response to user:", prod_output) output
Production model output: {"answer": "It's sunny and 75 degrees."}
Shadow model output: {"answer": "The weather is sunny with a temperature of 75°F."}
Response to user: {"answer": "It's sunny and 75 degrees."} Common variations
- Async calls: Use asynchronous HTTP clients like
httpxoraiohttpto avoid blocking production responses. - Streaming: For streaming models, mirror streams to shadow models with buffering.
- Different SDKs: Use cloud provider SDKs (AWS SageMaker, Azure ML) to route traffic for shadow deployments.
- Shadow traffic percentage: Instead of mirroring all traffic, send a sample (e.g., 10%) to the shadow model.
Troubleshooting
- If shadow model calls time out, reduce timeout or run calls asynchronously to avoid slowing production.
- If logs show mismatched inputs, verify payload serialization matches both models' expected formats.
- If shadow model crashes, isolate it from production traffic and fix bugs before resuming shadow deployment.
- Monitor latency impact carefully to ensure shadow deployment does not degrade user experience.
Key Takeaways
- Shadow deployment routes live traffic to a new model without affecting user responses for safe testing.
- Log and compare outputs from production and shadow models to detect regressions or improvements.
- Use asynchronous calls or traffic sampling to minimize latency impact on production.
- Monitor shadow model performance and errors separately to avoid production disruptions.