Code Beginner easy · 4 min

Loading a generation model

What you will learn

Use AutoModelForCausalLM to load a pre-trained language model for text generation with proper device and memory configuration.

Why this matters

Generating text is the foundation of most LLM applications (chatbots, summarization, code generation), and loading the model correctly determines whether your code runs on your hardware without crashes or hangs.

Skip if: Don't use AutoModelForCausalLM if you're doing classification (use AutoModelForSequenceClassification), token tagging (AutoModelForTokenClassification), or working with encoder-only models like BERT for inference-only tasks without generation.

Explanation

What it is: AutoModelForCausalLM is a wrapper that automatically detects a model's architecture and loads it configured for text generation: predicting the next token given previous tokens. It's the standard entry point for loading any generative transformer (GPT-2, Llama, Mistral, etc.) from the Hugging Face model hub.

How it works: When you call AutoModelForCausalLM.from_pretrained(), transformers downloads the model weights and config file, inspects the architecture, instantiates the correct class (e.g., GPT2LMHeadModel or LlamaForCausalLM), and loads the weights. The device_map='auto' parameter tells it to intelligently split the model across your GPU/CPU (essential for large models). torch_dtype=torch.bfloat16 reduces memory usage by 50% without significant quality loss on modern GPUs.

When to use it: Use this whenever you're loading a model from the hub for text generation. It's idiomatic, forward-compatible, and handles model discovery automatically: no need to know the exact class name.

Analogy

It's like using a universal USB adapter: you plug in your device (model name) and it automatically detects whether it's HDMI, DisplayPort, or USB-C underneath, then configures itself correctly. You don't need to know the internal wiring.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'gpt2'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.bfloat16
)

print(f'Model loaded: {model_name}')
print(f'Model type: {type(model).__name__}')
print(f'Device: {next(model.parameters()).device}')
print(f'Dtype: {next(model.parameters()).dtype}')

Output

Model loaded: gpt2
Model type: GPT2LMHeadModel
Device: cpu
Dtype: torch.bfloat16

What just happened?

The code downloaded GPT-2's weights and config from huggingface.co, inspected the architecture, instantiated a GPT2LMHeadModel, loaded the weights into it with bfloat16 precision, and placed it on the CPU (or GPU if one was available). The three print statements confirmed the model name, its internal class type, which device it lives on, and its numeric precision.

Common gotcha

Forgetting device_map='auto' means the model defaults to CPU, which is catastrophically slow for generation and may OOM. Also, torch_dtype must be set *before* loading weights: changing it after load does nothing. A second subtle gotcha: if you load in bfloat16 but your GPU doesn't support it (very old NVIDIA cards), you'll get silent numerical degradation or errors; use torch_dtype=torch.float16 as a safer fallback.

Error recovery

OSError: Can't load model

The model name doesn't exist on the hub. Verify at huggingface.co/models. Check spelling and that it's public (not a private repo).

RuntimeError: CUDA out of memory

The model is too large for your GPU. Add <code>torch_dtype=torch.float16</code> (or bfloat16 if supported) and <code>load_in_8bit=True</code> (requires bitsandbytes: <code>pip install bitsandbytes</code>). Or load on CPU with <code>device_map='cpu'</code>.

ValueError: Unrecognized model identifier

The repo structure is corrupted or missing config.json. Try <code>AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)</code> if the model uses custom code, but only for trusted sources.

OutOfMemoryError even with float16

Model genuinely doesn't fit your hardware. Use a smaller model (e.g., 'distilgpt2' instead of 'gpt2-xl'), or quantize with 4-bit (requires AutoGPTQForCausalLM or similar).

Experienced dev note

In transformers 4.x, the default was CPU + full precision (float32), which meant most code examples were deceptively slow in production. In 5.x, always pair device_map='auto' with torch_dtype=torch.bfloat16 as your baseline: this is now the idiomatic pattern. Also: never assume GPU is available; always check torch.cuda.is_available() before setting dtype to float16, or let device_map handle it automatically. One more: from_pretrained() is *blocking and slow* on first run (downloads gigabytes); wrap it in a loading indicator or cache the model locally with HF_HOME=/path/to/cache environment variable for reproducible CI/CD.

Check your understanding

You load a model with device_map='auto' and torch_dtype=torch.bfloat16, but generation is still very slow. What are two possible causes, and how would you diagnose which one?

Show answer hint

A correct answer identifies that slowness could be from (1) the model still being on CPU despite device_map='auto' (check with <code>next(model.parameters()).device</code>), or (2) the GPU lacking bfloat16 support so it's falling back to a slower type. Diagnosis: print the device and dtype after loading.

VERSION transformers 5.0+ completely removed the old device_map string options ('balanced', 'sequential') in favor of the new auto-sharding strategy. Code using 4.x patterns like model.to('cuda') after loading is now discouraged: set device_map during from_pretrained() instead. Also, torch_dtype=torch.float16 now uses mixed precision automatically; setting it manually is mostly for memory optimization, not for changing behavior.

Now that you've loaded the model, learn how to tokenize input text and pass it to the model so it can actually generate predictions.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.