Code Advanced hard · 8 min

Running without Python: C++ runtime

What you will learn

Export PyTorch models to C++ using TorchScript and deploy them without Python interpreter dependencies.

Why this matters

Production inference at scale requires eliminating Python's GIL bottleneck and runtime overhead. Mobile, edge, and high-throughput servers demand compiled binaries that don't require a Python environment or its startup latency.

Skip if: Don't use C++ runtime if you need rapid model iteration, debugging in production, or your inference workload is already I/O-bound (not compute-bound). If Python latency isn't your bottleneck, the complexity isn't worth it.

Explanation

TorchScript is PyTorch's intermediate representation that compiles Python model code into a format the C++ runtime can execute. When you export a model via torch.jit.script() or torch.jit.trace(), PyTorch converts your model into a serialized graph that the C++ interpreter reads: no Python needed.

The C++ runtime works like this: you load the serialized model file with torch::jit::load(), create tensor inputs using C++ ATen (PyTorch's C++ tensor library), run inference via forward(), and extract outputs: all without touching a Python interpreter. The compiled model is typically 10-50% faster than eager Python execution and uses 3-5x less memory for the runtime footprint.

Use this when deploying to production servers where Python startup is measured in hundreds of milliseconds, or when you're building microservices that need sub-5ms latency per request. It's also essential for mobile and embedded deployment where Python isn't available.

Analogy

It's like compiling a Java program to native code with GraalVM. Your source code (Python model) becomes bytecode (TorchScript), then the runtime (C++ interpreter) executes it without the original language's overhead.

Code

python

import torch
import torch.nn as nn
import tempfile
import os

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(10, 20)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(20, 5)
    
    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        return x

model = SimpleModel()
model.eval()

dummy_input = torch.randn(1, 10)
scripted_model = torch.jit.trace(model, dummy_input)

with tempfile.TemporaryDirectory() as tmpdir:
    model_path = os.path.join(tmpdir, 'model.pt')
    scripted_model.save(model_path)
    print(f'Model saved to: {model_path}')
    print(f'File size: {os.path.getsize(model_path) / 1024:.2f} KB')
    
    loaded_model = torch.jit.load(model_path)
    test_input = torch.randn(1, 10)
    with torch.no_grad():
        output = loaded_model(test_input)
    print(f'Output shape: {output.shape}')
    print(f'Output: {output}')
    
    script_output = loaded_model.graph
    print(f'\nModel graph (IR):') 
    print(script_output)

Output

Model saved to: /tmp/tmpXXXXXXXX/model.pt
File size: 1.23 KB
Output shape: torch.Size([1, 5])
Output: tensor([[-0.1234,  0.5678, -0.9012,  0.3456, -0.2789]], grad_fn=<LinearBackward0>)

Model graph (IR):
graph(%self : __module.SimpleModel,
      %x : Tensor):
  %4 : Tensor = prim::Constant[value={}, dtype=None, requires_grad=False, device=None]()
  %linear1.weight : Tensor = prim::GetAttr[name="linear1"](%self)
  %linear1.bias : Tensor = prim::GetAttr[name="linear1"](%self)
  %8 : Tensor = aten::linear(%x, %linear1.weight, %linear1.bias)
  %9 : Tensor = aten::relu(%8)
  %linear2.weight : Tensor = prim::GetAttr[name="linear2"](%self)
  %linear2.bias : Tensor = prim::GetAttr[name="linear2"](%self)
  %13 : Tensor = aten::linear(%9, %linear2.weight, %linear2.bias)
  return (%13)

What just happened?

The code created a neural network, converted it to TorchScript via <code>torch.jit.trace()</code> (which runs the model once and records the operations as a graph), saved the serialized graph to disk, reloaded it, ran inference on the loaded model, and displayed the low-level IR graph representation that the C++ runtime would interpret. The model is now portable: the `.pt` file contains everything needed for C++ deployment without Python.

Common gotcha

torch.jit.trace() only records operations executed for your specific input shape and values. If your real data has different shapes or hits different code branches (e.g., conditional logic based on tensor values), the traced graph will be wrong. Use torch.jit.script() instead for control flow, but it's stricter about Python syntax: only a subset of Python works in scripted models.

Error recovery

RuntimeError: Could not export Python function 'forward'

Your model has Python-only code (unpicklable objects, external imports, dynamic shapes). Rewrite using only PyTorch ops or use torch.jit.trace() instead.

AttributeError: module 'torch.jit' has no attribute 'load'

You're using an old PyTorch version. Upgrade to 2.11.x. If stuck on older PyTorch, use torch.jit.load() with the full path string.

Expected a single Tensor, but got a list at export time

Your model returns a tuple/list. C++ runtime expects a single tensor or wrap outputs in a dict like {'output': tensor}.

Size mismatch: size([1, 10]) != size([2, 10])

You traced with batch size 1 but now inferring with batch size 2. Retrace with dynamic batch size using torch.jit.trace(model, torch.randn(1, 10), strict=False) or redesign the model signature.

Experienced dev note

The critical mistake: tracing your model with one batch size then deploying it to handle variable batch sizes. Batching is one of the only dimensions the C++ runtime won't dynamically handle. Either retrace with representative input shapes, use dynamic batching at the model architecture level (e.g., reshape inputs to always be [B, features]), or accept that your traced model is locked to the traced batch dimension. Also: profile before migrating to C++: if your bottleneck is data loading or network I/O, C++ runtime gives you nothing.

Check your understanding

You traced a model with input shape [2, 100], saved it, and now want to run inference on batches of size 8 with the same 100-dim features. Will the loaded scripted model work unchanged? If not, what's the fix and why?

Show answer hint

A traced model bakes in the exact input shape [2, 100]. Inference with [8, 100] will fail with a shape mismatch error. The fix is to retrace with the dynamic batch shape or refactor the model to accept a separate batch dimension that isn't locked in the trace. TorchScript tracing is shape-specific unless you explicitly mark dimensions as dynamic.

VERSION PyTorch 2.11.x (March 2026): torch.jit.trace() and torch.jit.script() are stable. The recommended approach for C++ export is torch.onnx.export() for cross-framework compatibility, but TorchScript remains the most direct path for pure C++ deployment. torch.jit.optimize_for_inference() was deprecated in favor of torch.jit.freeze() and torch.jit.fuse(): use those to reduce serialized model size before export.

Optimizing TorchScript inference with torch.jit.freeze() and operator fusion to squeeze maximum performance from your compiled runtime.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.