Code Beginner easy · 5 min

AutoModelForCausalLM.from_pretrained(): text generation

What you will learn
Load a pretrained language model from Hugging Face Hub and generate text with it.

Why this matters

This is the most practical way to start generating text with a pretrained model without training anything yourself: it's the entry point to all LLM work in transformers.

Skip if: Don't use from_pretrained() if you're building a custom model from scratch with your own architecture, or if you only need inference through a simple pipeline() without memory/speed concerns.

Explanation

AutoModelForCausalLM.from_pretrained() loads a pretrained causal language model (models that predict the next token given previous tokens) from the Hugging Face Hub directly into memory.

Mechanically: it downloads the model weights and config from the Hub if not cached locally, instantiates the model class based on config.json, moves it to your device (GPU if available), and returns a ready-to-use model object. You pair it with an AutoTokenizer to convert text to tokens, run the model, and decode the output back to text.

Use this when you want a working LLM for inference or fine-tuning: it's the canonical starting point before you optimize or customize.

Analogy

Think of from_pretrained() as checking out a library book. The Hub is the library, the model ID is the book title, and your cache is your bookshelf. The first time you 'check out' a model, it downloads; next time, it comes straight from your shelf.

Code

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float32)

prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    inputs["input_ids"],
    max_length=50,
    num_beams=1,
    do_sample=False
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Output
Once upon a time, there was a man who lived in a small village. He was a carpenter, and he was very good at his job. One day, a rich man came to him and asked him to build a house for him. The carpenter agreed, and he started to work on the house.

What just happened?

The code loaded GPT-2 from the Hub, tokenized the prompt 'Once upon a time' into token IDs, fed those IDs through the model's generate() method to produce 50 tokens total, then decoded the token IDs back to readable text and printed it.

Common gotcha

Forgetting to set device_map='auto' means the model loads to CPU by default, which is glacially slow. Even with a small model like GPT-2, generation takes 10x longer. Always pin the device explicitly.

Error recovery

OutOfMemoryError
The model is too large for your GPU. Add torch_dtype=torch.float16 or torch_dtype=torch.bfloat16 to reduce memory, or use BitsAndBytesConfig for 4-bit quantization.
ConnectionError
The Hub is unreachable or the model ID doesn't exist. Check your internet, verify the model name on huggingface.co/models, and ensure you have permission to access private models.
TypeError: 'NoneType' object is not subscriptable
You didn't pass return_tensors='pt' to the tokenizer. The tokenizer returns a dict only when return_tensors is set; otherwise it returns plain lists.

Experienced dev note

In transformers 4.x, people often loaded models and worried about device placement manually. In 5.5.x, device_map='auto' handles it intelligently: it shards across GPUs if needed, offloads to CPU if the model is huge, and keeps execution fast. This one flag prevents 80% of 'why is my model slow' debugging. Also: always pin torch_dtype to avoid silent precision mismatches that degrade output quality.

Check your understanding

If you load a model with device_map='auto' and torch_dtype=torch.float16, then call model.generate() without changing anything, will the generated text be exactly the same as if you loaded with torch_dtype=torch.float32? Why or why not?

Show answer hint

No: torch.float16 uses lower precision, which means different rounding and slightly different numerical results at each forward pass, leading to different token selections during generation. The model weights themselves are loaded in float16, changing all computations downstream.

VERSION In transformers < 4.30, from_pretrained() did not accept device_map; you had to use .to(device) manually. In 5.5.x, device_map='auto' is the standard pattern and .to(device) is considered less flexible. Always use device_map in production.
NEXT

Now that you can load and generate text, learn how to control generation quality with temperature, top_k, and top_p sampling instead of greedy decoding.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.