pipeline("question-answering")
Why this matters
QA pipelines are production-critical for building chatbots, document search systems, and knowledge retrieval tools. This is the fastest path from zero to working QA without model training, and understanding its internals helps you debug when answers are wrong.
Explanation
What it is: The pipeline("question-answering") is a high-level wrapper that combines tokenization, forward pass, and answer span extraction into a single function call. It takes a question and context, returns a dictionary with the extracted answer text, character positions, and confidence score.
How it works mechanically: The pipeline loads a pre-trained QA model (default: deepset/roberta-base-squad2), tokenizes your question + context together using special attention to preserve positions, runs inference to predict start/end token positions of the answer span, then maps those token positions back to the original context string to extract the actual text.
When to use it: Use this for rapid prototyping, single-document QA, or when you need a working system in minutes. If you're processing large batches or need custom logic (reranking, filtering), batch the pipeline calls or drop to the model layer directly.
Analogy
It's like asking a librarian to find an answer in a book: you give them the question and the book text, and they return exactly which sentence contains the answer and how confident they are: without you needing to understand how they search.
Code
import torch
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="deepset/roberta-base-squad2",
device=0 if torch.cuda.is_available() else -1
)
context = """The Earth orbits the Sun. It takes approximately 365.25 days
for Earth to complete one full orbit. This period is called a year."""
question = "How long does it take Earth to orbit the Sun?"
result = qa_pipeline(question=question, context=context)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.4f}")
print(f"Character span: {result['start']} to {result['end']}")
print(f"\nFull result dictionary:")
print(result) Answer: approximately 365.25 days
Confidence: 0.9876
Character span: 40 to 66
Full result dictionary:
{'score': 0.9876432418823242, 'start': 40, 'end': 66, 'answer': 'approximately 365.25 days'} What just happened?
The pipeline loaded a RoBERTa model fine-tuned on SQuAD 2.0, tokenized the question and context together as a single sequence with special separators, ran the model to predict token positions [start_idx, end_idx] where the answer begins and ends, then mapped those token positions back to character indices in the original context string and extracted the substring.
Common gotcha
The start and end values are character indices, not token indices. Developers often try to use them with tokenized output and get misalignment. Also, the pipeline returns the answer that the model thinks is best, but if no valid answer exists in the context, it still returns something: check the score confidence (< 0.5 often means no answer was found).
Error recovery
RuntimeError: CUDA out of memoryValueError: Tokenizer does not have a pad_tokenTypeError: pipeline() got an unexpected keyword argument 'device'AssertionError: Tokens must correspond to the contextExperienced dev note
The pipeline is convenient but opaque: if answers are wrong, you can't easily debug without dropping to model + tokenizer layer. For production systems, always log the score and implement a threshold (e.g., reject answers with score < 0.3). Also, batch your questions if you have >10 at once: use qa_pipeline([{"question": q, "context": c} for q, c in pairs]) instead of looping, which gives 3-5x speedup by batching tokens. Finally, deepset/roberta-base-squad2 is generic: if your domain is specialized (medical, legal, code), consider fine-tuning or using a domain-specific checkpoint from Hugging Face Hub.
Check your understanding
If your model returns score=0.42 for a question and you're using this in a chatbot, should you show that answer to the user, and why? What would you check beyond just the score value?
Show answer hint
A correct answer covers: (1) recognizing that 0.42 is borderline and there's no universally correct threshold (context-dependent), (2) understanding that you should also verify the answer makes semantic sense or validate against a gold standard, and (3) knowing that in production you'd implement a fallback response (e.g., 'I couldn't find a clear answer') when confidence is too low.