Extractive vs generative QA
Why this matters
Most developers default to generative QA (it feels more powerful), but extractive QA is faster, more controllable, and production-safer when you have a fixed document set. Understanding the tradeoff prevents building the wrong solution.
Explanation
Extractive QA treats question-answering as a span-selection problem: given a context document and question, the model finds the exact start and end positions of the answer within that context. Generative QA treats it as a sequence-to-sequence problem: the model generates token-by-token the answer from scratch, using the context as conditioning input but not constrained to it.
Mechanically, extractive QA uses models like BERT fine-tuned on SQuAD (two classification heads: one for start token, one for end token). Generative QA uses encoder-decoder models (like BART, T5, or LLMs) that output free-form text. Extractive runs in milliseconds; generative requires sampling/decoding and can hallucinate. Extractive answers are always substring-exact from the context; generative answers may paraphrase, summarize, or confabulate.
Use extractive when: answers exist verbatim in your documents, you need speed, or you need provenance (users trust answers more when they can see the source). Use generative when: you need reasoning, multi-document synthesis, or answers that require rephrasing.
Analogy
Extractive QA is like a Find dialog in a document: fast, exact, but limited to what's already there. Generative QA is like asking a colleague to explain it: flexible and conversational, but they might get creative or misremember.
Code
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
# Extractive QA using transformers 5.5.x
context = "Paris is the capital of France. It is located in northern France and is known for the Eiffel Tower."
question = "What is the capital of France?"
# Initialize extractive QA pipeline with pinned model
extraction_pipe = pipeline(
"question-answering",
model="deepset/roberta-base-squad2",
device=0 if torch.cuda.is_available() else -1
)
result_extractive = extraction_pipe(question=question, context=context)
print("Extractive Result:")
print(f"Answer: {result_extractive['answer']}")
print(f"Score: {result_extractive['score']:.4f}")
print(f"Start: {result_extractive['start']}, End: {result_extractive['end']}")
print()
# Generative QA using a sequence-to-sequence model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
gen_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
gen_model = AutoModelForSeq2SeqLM.from_pretrained(
"google/flan-t5-small",
device_map="auto",
torch_dtype=torch.float32
)
# Format input for generative QA
gen_input = f"question: {question} context: {context}"
inputs = gen_tokenizer(gen_input, return_tensors="pt", max_length=512, truncation=True)
# Move to same device as model
device = next(gen_model.parameters()).device
inputs = {k: v.to(device) for k, v in inputs.items()}
output_ids = gen_model.generate(**inputs, max_length=50)
result_generative = gen_tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Generative Result:")
print(f"Answer: {result_generative}")
print()
# Comparison: same QA, different approaches
print("=" * 50)
print("COMPARISON:")
print(f"Extractive found: '{result_extractive['answer']}' (verbatim from context)")
print(f"Generative produced: '{result_generative}' (model-generated)")
print(f"Extractive confidence: {result_extractive['score']:.2%}")
print(f"Extractive can pinpoint location: [{result_extractive['start']}:{result_extractive['end']}]") Extractive Result: Answer: Paris Score: 0.9987 Start: 0, End: 5 Generative Result: Answer: Paris is the capital of France. ================================================== COMPARISON: Extractive found: 'Paris' (verbatim from context) Generative produced: 'Paris is the capital of France.' (model-generated) Extractive confidence: 99.87% Extractive can pinpoint location: [0:5]
What just happened?
The code loaded two different QA systems. The extractive system (RoBERTa on SQuAD) scanned the context, found token positions for 'Paris', and returned exact span coordinates with confidence. The generative system (FLAN-T5) ingested the question and context as a prompt, ran the decoder autoregressively to produce 'Paris is the capital of France.': a complete sentence the model constructed, not found. The extractive answer is guaranteed to be a substring; the generative answer may reformat or synthesize.
Common gotcha
Developers assume generative QA is 'better' because it sounds more human. In reality, generative QA will confidently hallucinate an answer that sounds plausible but doesn't exist in your documents. Extractive QA can return an empty/low-confidence result if the answer isn't there: which is honest. For regulated use cases (finance, legal, medical), extractive + source attribution is safer. For open-domain QA, generative is necessary but requires guardrails.
Error recovery
ValueError: Token indices sequence length is longer than the maximum (512)RuntimeError: Expected all tensors to be on the same deviceAttributeError: 'NoneType' object has no attribute 'to'Experienced dev note
In transformers 4.x, you could get away with pipeline('question-answering') without pinning a model and it would download a default. In 5.5.x, this is discouraged: always pass explicit model name. More importantly: measure extractive QA on your actual documents first (it's 50-100x faster). Many teams jump to generative because it 'feels' smarter, spend months on prompt engineering, then realize extractive would have solved it in a week. Extractive also lets you version-control which span is the answer (for debugging); generative answers are black-box token sequences.
Check your understanding
You have a customer-support QA system where answers must always be traceable to your company's knowledge base (for compliance). You notice generative QA sometimes returns answers that sound right but don't appear in your documents. Why is this happening, and which approach should you switch to?
Show answer hint
A correct answer explains that generative models construct tokens autoregressively and can hallucinate plausible text outside the training/context distribution. Extractive QA is required here because it constrains answers to the knowledge base by design: it can only select spans that exist, making it auditable. The key insight is that 'making sense' and 'being grounded' are different properties.