Verifying with a pipeline call
Why this matters
Before writing token IDs, attention masks, and device management code, you need to know the model actually loads and produces sensible output: pipeline() handles all the complexity for you and runs in 3 lines.
Explanation
What it is: The pipeline() function is a Hugging Face convenience wrapper that combines tokenization, model loading, and inference into a single call. You pass raw text, get predictions back: no tensor manipulation required.
How it works: When you call pipeline(task_name, model=model_id), Hugging Face automatically selects the right tokenizer and model for that task, loads them to the default device (CPU or GPU if available), and returns a callable object. You then pass a string (or list of strings) to that object, and it internally tokenizes, feeds tensors to the model, and returns parsed outputs.
When to use it: Use pipeline() for quick verification that a model card works as advertised, for interactive demos, or for simple single-inference scenarios. Once you need batching, custom tokenization, or tight performance control, switch to explicit AutoTokenizer + AutoModel calls.
Analogy
Think of pipeline() as a pre-packaged meal: everything you need is assembled and ready. Explicit tokenizer + model is buying ingredients and cooking it yourself: more control, but more work.
Code
import torch
from transformers import pipeline
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
classifier = pipeline(
task="text-classification",
model=model_name,
device=0 if torch.cuda.is_available() else -1
)
result = classifier("This movie is absolutely fantastic!")
print(f"Input: 'This movie is absolutely fantastic!'")
print(f"Prediction: {result}")
result2 = classifier("I hated every second of it.")
print(f"\nInput: 'I hated every second of it.'")
print(f"Prediction: {result2}") Input: 'This movie is absolutely fantastic!'
Prediction: [{'label': 'POSITIVE', 'score': 0.9998761415481567}]
Input: 'I hated every second of it.'
Prediction: [{'label': 'NEGATIVE', 'score': 0.9997040033340454}] What just happened?
You imported the pipeline function, specified a sentiment classification model from the Hugging Face Hub, created a pipeline object that auto-downloaded and cached the tokenizer and model, and then called it twice with different sentences. Each call internally tokenized the text, converted tokens to tensors, ran inference through the model, and returned parsed label/score dictionaries. The device flag told the pipeline to use GPU if available (device=0), otherwise CPU (device=-1).
Common gotcha
Developers often call pipeline() multiple times in a loop, thinking each call is cheap. It is not: the first call downloads and loads the entire model from disk or network. Always create the pipeline object once and reuse it. Also, if you do not specify device=, pipeline() defaults to CPU on most setups even if GPU is available: always pass device=0 to explicitly use the first GPU.
Error recovery
OSError: Can't load ...RuntimeError: CUDA out of memoryTypeError: pipeline() got an unexpected keyword argument 'device'Experienced dev note
In transformers 5.5.x, always pass device=0 or device_map='auto' explicitly: do not rely on defaults. Also, pipeline() now respects torch.dtype natively, so if you need bfloat16 inference, set the model dtype before wrapping it in a pipeline. Finally, pipeline objects are not serializable with pickle: if you are running on distributed workers, create the pipeline on each worker, not once globally.
Check your understanding
Why does calling pipeline() a second time with different text not trigger a re-download of the model, and where is that model being kept between calls?
Show answer hint
The key insight is caching: pipeline() downloads once on first instantiation (not on first call), stores it in the HF_HOME cache directory (typically ~/.cache/huggingface/hub/), and subsequent instantiations or calls reuse that cached copy. This is why the code example creates one classifier object and calls it twice.