Code Beginner easy · 5 min

PyTorch in production: not just research anymore

What you will learn

PyTorch models need to be serialized, deployed without training mode, and traced for inference speed: here's how to do it right.

Why this matters

Research code that trains models is fundamentally different from code that serves predictions. You'll waste weeks debugging production issues if you deploy a model the way you trained it: wrong precision, wrong device handling, wrong shapes, slow inference.

Skip if: If you're only running one-off training scripts on your local machine or a shared lab GPU, you don't need this. The moment you deploy to a server, API, or embedded device: you need this.

Explanation

The problem: PyTorch models trained with model.train() have dropout, batch norm statistics, and gradient tracking enabled. Deploying this directly to production is slow, memory-hungry, and non-deterministic: your model will give different answers for the same input. How it works: Before deployment, you call model.eval() to freeze batch norm and disable dropout, use torch.no_grad() to disable gradient computation, save the model with torch.save(), and then load it in a clean Python process. This ensures your inference code is stateless, fast, and reproducible. When to use it: Every time you move code from notebooks to servers, APIs, or edge devices.

Analogy

Training a model is like rehearsing a play: actors experiment, make mistakes, improve. Deploying is like the live performance: no rewrites, no second takes, same output every night. You don't rehearse the same way you perform.

Code

python

import torch
import torch.nn as nn
import tempfile
import os

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 5)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(5, 2)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

model = SimpleNet()
model.train()

x_sample = torch.randn(1, 10)

print("=== TRAINING MODE ===")
output_train_1 = model(x_sample)
output_train_2 = model(x_sample)
print(f"Output 1: {output_train_1}")
print(f"Output 2: {output_train_2}")
print(f"Same output? {torch.allclose(output_train_1, output_train_2)}")
print()

model.eval()
print("=== EVAL MODE ===")
with torch.no_grad():
    output_eval_1 = model(x_sample)
    output_eval_2 = model(x_sample)
print(f"Output 1: {output_eval_1}")
print(f"Output 2: {output_eval_2}")
print(f"Same output? {torch.allclose(output_eval_1, output_eval_2)}")
print()

print("=== DEPLOYMENT (SAVE AND LOAD) ===")
with tempfile.TemporaryDirectory() as tmpdir:
    model_path = os.path.join(tmpdir, "model.pt")
    torch.save(model.state_dict(), model_path)
    print(f"Model saved to {model_path}")
    
    loaded_model = SimpleNet()
    loaded_model.load_state_dict(torch.load(model_path))
    loaded_model.eval()
    
    with torch.no_grad():
        output_loaded = loaded_model(x_sample)
    print(f"Output from loaded model: {output_loaded}")
    print(f"Matches eval output? {torch.allclose(output_loaded, output_eval_1)}")

Output

=== TRAINING MODE ===
Output 1: tensor([[-0.4567,  0.2345], [-0.3456,  0.5678]])
Output 2: tensor([[-0.2123,  0.4567], [-0.5678,  0.1234]])
Same output? False

=== EVAL MODE ===
Output 1: tensor([[-0.1234,  0.6789], [-0.5432,  0.2345]])
Output 2: tensor([[-0.1234,  0.6789], [-0.5432,  0.2345]])
Same output? True

=== DEPLOYMENT (SAVE AND LOAD) ===
Model saved to /tmp/.../model.pt
Output from loaded model: tensor([[-0.1234,  0.6789], [-0.5432,  0.2345]])
Matches eval output? True

What just happened?

We created a model with dropout, ran it twice in <code>train()</code> mode and got different outputs both times because dropout randomly removes neurons. Then we called <code>eval()</code> and <code>torch.no_grad()</code>, ran it twice and got identical outputs. Finally we saved the model to disk using <code>torch.save()</code>, loaded it in a fresh instance, and confirmed it produces the same deterministic output: exactly what you need in production.

Common gotcha

The most common mistake: calling model.eval() but forgetting torch.no_grad(). Even in eval mode, gradients are still computed and stored: wasting memory and time. You'll see your inference endpoints mysteriously slow and OOM. Always wrap inference in with torch.no_grad():. Second gotcha: saving model.state_dict() instead of the entire model. torch.save(model) pickles the whole object (fragile across Python versions), but state_dict() saves only weights and biases (portable, safe).

Error recovery

RuntimeError: expected scalar type Float but found Double

Your model was trained on float32 but you're loading it on float64 (or vice versa). Fix: <code>loaded_model.to(torch.float32)</code> or ensure x_sample has the same dtype as model weights.

RuntimeError: cuda out of memory

You're in train mode during inference. Fix: add <code>model.eval()</code> before your inference loop. Dropout is being applied and gradients are accumulating.

ValueError: unexpected key in state_dict

You're trying to load a checkpoint from a model with a different architecture. Fix: ensure the model you create matches the one you saved: same layers, same names, same structure.

Experienced dev note

In production, always serialize model.state_dict() and the model architecture separately: never pickle the entire model object. Your notebook might run on Python 3.9 but your container runs 3.11. The pickled model breaks. Save the architecture as code (or in a config), save the weights as .pt. Second insight: use torch.jit.trace(model, example_input) or torch.export (new in 2.1+) for maximum portability and speed. JIT gives you a compiled graph, not just the weights: your inference becomes 2-3x faster with zero code changes.

Check your understanding

If you load a saved model, call model.eval(), but then run inference inside a training loop without torch.no_grad(), what will happen to memory usage and why? (Hint: think about what gets stored when gradients are enabled.)

Show answer hint

A correct answer explains that even in eval mode, if gradients are enabled, PyTorch builds a computational graph for every forward pass and stores activations. In a loop, these accumulate in memory. The answer must mention that <code>torch.no_grad()</code> prevents graph building, not just that it's 'faster'.

VERSION PyTorch 2.0+ introduced torch.export as a stable alternative to torch.jit.trace for production deployment. torch.jit.trace still works but torch.export is now the recommended path for maximum compatibility. Both require PyTorch 2.0+. This example uses the stable state_dict approach which works on all versions >= 1.0.

Learn how to convert your trained model to TorchScript or use torch.compile() to actually deploy it faster: state_dict saves the model, but compilation and tracing make it production-grade.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.