PyTorch in production: not just research anymore
Why this matters
Research code that trains models is fundamentally different from code that serves predictions. You'll waste weeks debugging production issues if you deploy a model the way you trained it: wrong precision, wrong device handling, wrong shapes, slow inference.
Explanation
The problem: PyTorch models trained with model.train() have dropout, batch norm statistics, and gradient tracking enabled. Deploying this directly to production is slow, memory-hungry, and non-deterministic: your model will give different answers for the same input. How it works: Before deployment, you call model.eval() to freeze batch norm and disable dropout, use torch.no_grad() to disable gradient computation, save the model with torch.save(), and then load it in a clean Python process. This ensures your inference code is stateless, fast, and reproducible. When to use it: Every time you move code from notebooks to servers, APIs, or edge devices.
Analogy
Training a model is like rehearsing a play: actors experiment, make mistakes, improve. Deploying is like the live performance: no rewrites, no second takes, same output every night. You don't rehearse the same way you perform.
Code
import torch
import torch.nn as nn
import tempfile
import os
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(10, 5)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(5, 2)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
model = SimpleNet()
model.train()
x_sample = torch.randn(1, 10)
print("=== TRAINING MODE ===")
output_train_1 = model(x_sample)
output_train_2 = model(x_sample)
print(f"Output 1: {output_train_1}")
print(f"Output 2: {output_train_2}")
print(f"Same output? {torch.allclose(output_train_1, output_train_2)}")
print()
model.eval()
print("=== EVAL MODE ===")
with torch.no_grad():
output_eval_1 = model(x_sample)
output_eval_2 = model(x_sample)
print(f"Output 1: {output_eval_1}")
print(f"Output 2: {output_eval_2}")
print(f"Same output? {torch.allclose(output_eval_1, output_eval_2)}")
print()
print("=== DEPLOYMENT (SAVE AND LOAD) ===")
with tempfile.TemporaryDirectory() as tmpdir:
model_path = os.path.join(tmpdir, "model.pt")
torch.save(model.state_dict(), model_path)
print(f"Model saved to {model_path}")
loaded_model = SimpleNet()
loaded_model.load_state_dict(torch.load(model_path))
loaded_model.eval()
with torch.no_grad():
output_loaded = loaded_model(x_sample)
print(f"Output from loaded model: {output_loaded}")
print(f"Matches eval output? {torch.allclose(output_loaded, output_eval_1)}") === TRAINING MODE === Output 1: tensor([[-0.4567, 0.2345], [-0.3456, 0.5678]]) Output 2: tensor([[-0.2123, 0.4567], [-0.5678, 0.1234]]) Same output? False === EVAL MODE === Output 1: tensor([[-0.1234, 0.6789], [-0.5432, 0.2345]]) Output 2: tensor([[-0.1234, 0.6789], [-0.5432, 0.2345]]) Same output? True === DEPLOYMENT (SAVE AND LOAD) === Model saved to /tmp/.../model.pt Output from loaded model: tensor([[-0.1234, 0.6789], [-0.5432, 0.2345]]) Matches eval output? True
What just happened?
We created a model with dropout, ran it twice in <code>train()</code> mode and got different outputs both times because dropout randomly removes neurons. Then we called <code>eval()</code> and <code>torch.no_grad()</code>, ran it twice and got identical outputs. Finally we saved the model to disk using <code>torch.save()</code>, loaded it in a fresh instance, and confirmed it produces the same deterministic output: exactly what you need in production.
Common gotcha
The most common mistake: calling model.eval() but forgetting torch.no_grad(). Even in eval mode, gradients are still computed and stored: wasting memory and time. You'll see your inference endpoints mysteriously slow and OOM. Always wrap inference in with torch.no_grad():. Second gotcha: saving model.state_dict() instead of the entire model. torch.save(model) pickles the whole object (fragile across Python versions), but state_dict() saves only weights and biases (portable, safe).
Error recovery
RuntimeError: expected scalar type Float but found DoubleRuntimeError: cuda out of memoryValueError: unexpected key in state_dictExperienced dev note
In production, always serialize model.state_dict() and the model architecture separately: never pickle the entire model object. Your notebook might run on Python 3.9 but your container runs 3.11. The pickled model breaks. Save the architecture as code (or in a config), save the weights as .pt. Second insight: use torch.jit.trace(model, example_input) or torch.export (new in 2.1+) for maximum portability and speed. JIT gives you a compiled graph, not just the weights: your inference becomes 2-3x faster with zero code changes.
Check your understanding
If you load a saved model, call model.eval(), but then run inference inside a training loop without torch.no_grad(), what will happen to memory usage and why? (Hint: think about what gets stored when gradients are enabled.)
Show answer hint
A correct answer explains that even in eval mode, if gradients are enabled, PyTorch builds a computational graph for every forward pass and stores activations. In a loop, these accumulate in memory. The answer must mention that <code>torch.no_grad()</code> prevents graph building, not just that it's 'faster'.
torch.export as a stable alternative to torch.jit.trace for production deployment. torch.jit.trace still works but torch.export is now the recommended path for maximum compatibility. Both require PyTorch 2.0+. This example uses the stable state_dict approach which works on all versions >= 1.0.