How to beginner · 3 min read

How to run LLMs on GPU with Hugging Face

Q: How to run LLMs on GPU with Hugging Face

To run LLMs on GPU with Hugging Face, load the model with from_pretrained and move it to GPU using model.to('cuda') in PyTorch or set device='GPU' in TensorFlow. Ensure you have a CUDA-enabled GPU and the appropriate transformers and torch or tensorflow versions installed.

Quick answer

To run LLMs on GPU with Hugging Face, load the model with from_pretrained and move it to GPU using model.to('cuda') in PyTorch or set device='GPU' in TensorFlow. Ensure you have a CUDA-enabled GPU and the appropriate transformers and torch or tensorflow versions installed.

PREREQUISITES

Python 3.8+
pip install transformers torch (for PyTorch) or tensorflow (for TensorFlow)
CUDA-enabled GPU with drivers installed

Setup

Install the required libraries and verify GPU availability. Use pip install transformers torch for PyTorch or pip install transformers tensorflow for TensorFlow. Confirm your GPU is detected by PyTorch or TensorFlow.

bash

pip install transformers torch
# or for TensorFlow
pip install transformers tensorflow

Step by step

This example shows loading a Hugging Face LLM with PyTorch and running inference on GPU.

python

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load tokenizer and model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Prepare input
input_text = "Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate output
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

output

Hello, how are you? I am a language model developed by OpenAI. How can I assist you today?

Common variations

For TensorFlow, load the model with from_pretrained and specify from_pt=True if needed, then run on GPU automatically if available. You can also use accelerate library for multi-GPU or mixed precision. Streaming and async inference require additional setup.

python

from transformers import TFAutoModelForCausalLM, AutoTokenizer
import tensorflow as tf

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForCausalLM.from_pretrained(model_name)

input_text = "Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="tf")

# TensorFlow automatically uses GPU if available
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

output

Hello, how are you? I am a language model developed by OpenAI. How can I assist you today?

Troubleshooting

If torch.cuda.is_available() returns False, check your CUDA installation and GPU drivers.
Out of memory errors can be mitigated by using smaller models or enabling mixed precision with accelerate.
Ensure your transformers and torch or tensorflow versions are compatible.

✅

Key Takeaways

Use model.to('cuda') in PyTorch to run Hugging Face LLMs on GPU.
TensorFlow models run on GPU automatically if available and properly installed.
Install CUDA drivers and verify GPU availability before running models.
Use the accelerate library for advanced GPU setups like multi-GPU or mixed precision.
Smaller models or batch sizes help avoid GPU memory errors.

Verified 2026-04 · gpt2

Verify ↗