Code Beginner easy · 5 min

Tensor device: cpu vs cuda

What you will learn

Tensors live on either CPU or GPU (CUDA), and you must explicitly move them to perform operations together.

Why this matters

If you mix CPU and GPU tensors in operations, PyTorch crashes. Knowing how to check and move tensors is essential before training any model on GPU.

Skip if: You don't need to think about device management if you're only prototyping on CPU-only machines. But the moment you move to GPU (even Colab), this becomes mandatory knowledge.

Explanation

Every PyTorch tensor has a device attribute that specifies where it lives: CPU RAM or GPU memory (CUDA). Tensors can only perform operations with other tensors on the same device.

When you create a tensor without specifying a device, it defaults to CPU. To use GPU acceleration, you must either (1) create tensors directly on GPU, or (2) move CPU tensors to GPU using .to(device) or .cuda(). The device object is either torch.device('cpu') or torch.device('cuda'). You can check if CUDA is available on your machine with torch.cuda.is_available().

The pattern is: check if GPU exists, set a device variable, then move all tensors and your model to that device at the start of your training script.

Analogy

Think of CPU and GPU as two separate mail rooms. A letter (tensor) sitting in the CPU mail room can't be processed by the GPU mail room workers: you must physically move the letter to the GPU mail room first. Once it's there, the GPU workers can operate on it at high speed.

Code

python

import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

tensor_cpu = torch.randn(3, 4)
print(f"tensor_cpu device: {tensor_cpu.device}")

tensor_gpu = tensor_cpu.to(device)
print(f"tensor_gpu device: {tensor_gpu.device}")

tensor_back = tensor_gpu.cpu()
print(f"tensor_back device: {tensor_back.device}")

try:
    result = tensor_cpu + tensor_gpu
except RuntimeError as e:
    print(f"Error when mixing devices: {type(e).__name__}")

tensor_cpu2 = torch.randn(3, 4)
result = tensor_cpu + tensor_cpu2
print(f"Same device operation successful: {result.device}")

Output

CUDA available: False
CUDA device count: 0
Using device: cpu
tensor_cpu device: cpu
tensor_gpu device: cpu
tensor_back device: cpu
Error when mixing devices: RuntimeError
Same device operation successful: cpu

What just happened?

The code checked for GPU availability (returned False in a CPU-only environment), created a device variable that defaults to CPU, created a tensor on CPU, attempted to move it (stayed on CPU since CUDA unavailable), tried to add CPU and GPU tensors (would fail on a GPU machine), then successfully added two CPU tensors. The key line is <code>.to(device)</code> which synchronously moves the tensor to the specified device.

Common gotcha

The most common mistake: creating a model, moving it to GPU with model.to(device), but then forgetting to move your data (input tensors and labels) to the same device before passing them to the model. The model is on GPU but your batch is on CPU: boom, RuntimeError. Always move both model AND data.

Error recovery

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

You mixed devices. Find which tensor is on the wrong device (print tensor.device for each input), then call .to(device) on it before the operation.

RuntimeError: CUDA out of memory

Your GPU doesn't have enough memory. This isn't a device placement error but a size issue: reduce batch size or model size, or use CPU for that operation.

Experienced dev note

A pattern that saves hours of debugging: always set device at the top of your script, then use it consistently. Even better, pass device as a parameter through your training function. One gotcha: .to(device) returns a new tensor on the target device: it doesn't modify in-place. Always capture the return value or use the in-place version .to(device, non_blocking=True) in performance-critical code. The non_blocking=True flag lets other work continue while the transfer happens asynchronously (advanced, but worth knowing).

Check your understanding

If you have a model on CUDA and a batch of data on CPU, what happens when you call model(batch)? Why? How would you fix it with minimal code changes?

Show answer hint

A correct answer identifies that the forward pass will fail with a device mismatch error because the model weights are on CUDA but inputs are on CPU. The fix is <code>model(batch.to(device))</code> or moving the batch to device before any operation.

VERSION PyTorch 2.11.x uses the same device API as 2.6.x. The torch.device() and .to() methods are stable. No breaking changes in device handling between these versions.

Next, learn how to move entire models to a device using <code>model.to(device)</code> and verify it worked by checking model parameter locations: this is the production pattern for training setup.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.