Running without Python: C++ runtime
Why this matters
Production inference at scale requires eliminating Python's GIL bottleneck and runtime overhead. Mobile, edge, and high-throughput servers demand compiled binaries that don't require a Python environment or its startup latency.
Explanation
TorchScript is PyTorch's intermediate representation that compiles Python model code into a format the C++ runtime can execute. When you export a model via torch.jit.script() or torch.jit.trace(), PyTorch converts your model into a serialized graph that the C++ interpreter reads: no Python needed.
The C++ runtime works like this: you load the serialized model file with torch::jit::load(), create tensor inputs using C++ ATen (PyTorch's C++ tensor library), run inference via forward(), and extract outputs: all without touching a Python interpreter. The compiled model is typically 10-50% faster than eager Python execution and uses 3-5x less memory for the runtime footprint.
Use this when deploying to production servers where Python startup is measured in hundreds of milliseconds, or when you're building microservices that need sub-5ms latency per request. It's also essential for mobile and embedded deployment where Python isn't available.
Analogy
It's like compiling a Java program to native code with GraalVM. Your source code (Python model) becomes bytecode (TorchScript), then the runtime (C++ interpreter) executes it without the original language's overhead.
Code
import torch
import torch.nn as nn
import tempfile
import os
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = nn.Linear(10, 20)
self.relu = nn.ReLU()
self.linear2 = nn.Linear(20, 5)
def forward(self, x):
x = self.linear1(x)
x = self.relu(x)
x = self.linear2(x)
return x
model = SimpleModel()
model.eval()
dummy_input = torch.randn(1, 10)
scripted_model = torch.jit.trace(model, dummy_input)
with tempfile.TemporaryDirectory() as tmpdir:
model_path = os.path.join(tmpdir, 'model.pt')
scripted_model.save(model_path)
print(f'Model saved to: {model_path}')
print(f'File size: {os.path.getsize(model_path) / 1024:.2f} KB')
loaded_model = torch.jit.load(model_path)
test_input = torch.randn(1, 10)
with torch.no_grad():
output = loaded_model(test_input)
print(f'Output shape: {output.shape}')
print(f'Output: {output}')
script_output = loaded_model.graph
print(f'\nModel graph (IR):')
print(script_output) Model saved to: /tmp/tmpXXXXXXXX/model.pt
File size: 1.23 KB
Output shape: torch.Size([1, 5])
Output: tensor([[-0.1234, 0.5678, -0.9012, 0.3456, -0.2789]], grad_fn=<LinearBackward0>)
Model graph (IR):
graph(%self : __module.SimpleModel,
%x : Tensor):
%4 : Tensor = prim::Constant[value={}, dtype=None, requires_grad=False, device=None]()
%linear1.weight : Tensor = prim::GetAttr[name="linear1"](%self)
%linear1.bias : Tensor = prim::GetAttr[name="linear1"](%self)
%8 : Tensor = aten::linear(%x, %linear1.weight, %linear1.bias)
%9 : Tensor = aten::relu(%8)
%linear2.weight : Tensor = prim::GetAttr[name="linear2"](%self)
%linear2.bias : Tensor = prim::GetAttr[name="linear2"](%self)
%13 : Tensor = aten::linear(%9, %linear2.weight, %linear2.bias)
return (%13) What just happened?
The code created a neural network, converted it to TorchScript via <code>torch.jit.trace()</code> (which runs the model once and records the operations as a graph), saved the serialized graph to disk, reloaded it, ran inference on the loaded model, and displayed the low-level IR graph representation that the C++ runtime would interpret. The model is now portable: the `.pt` file contains everything needed for C++ deployment without Python.
Common gotcha
torch.jit.trace() only records operations executed for your specific input shape and values. If your real data has different shapes or hits different code branches (e.g., conditional logic based on tensor values), the traced graph will be wrong. Use torch.jit.script() instead for control flow, but it's stricter about Python syntax: only a subset of Python works in scripted models.
Error recovery
RuntimeError: Could not export Python function 'forward'AttributeError: module 'torch.jit' has no attribute 'load'Expected a single Tensor, but got a list at export timeSize mismatch: size([1, 10]) != size([2, 10])Experienced dev note
The critical mistake: tracing your model with one batch size then deploying it to handle variable batch sizes. Batching is one of the only dimensions the C++ runtime won't dynamically handle. Either retrace with representative input shapes, use dynamic batching at the model architecture level (e.g., reshape inputs to always be [B, features]), or accept that your traced model is locked to the traced batch dimension. Also: profile before migrating to C++: if your bottleneck is data loading or network I/O, C++ runtime gives you nothing.
Check your understanding
You traced a model with input shape [2, 100], saved it, and now want to run inference on batches of size 8 with the same 100-dim features. Will the loaded scripted model work unchanged? If not, what's the fix and why?
Show answer hint
A traced model bakes in the exact input shape [2, 100]. Inference with [8, 100] will fail with a shape mismatch error. The fix is to retrace with the dynamic batch shape or refactor the model to accept a separate batch dimension that isn't locked in the trace. TorchScript tracing is shape-specific unless you explicitly mark dimensions as dynamic.