Code Intermediate medium · 7 min

Extractive vs generative QA

What you will learn

Extractive QA finds answers within the context, while generative QA creates answers from scratch: choose based on whether your answer must be verbatim from source.

Why this matters

Most developers default to generative QA (it feels more powerful), but extractive QA is faster, more controllable, and production-safer when you have a fixed document set. Understanding the tradeoff prevents building the wrong solution.

Skip if: Don't use extractive QA if you need to synthesize information across multiple documents, reason about implicit knowledge, or reformat answers. Don't use generative QA if you must guarantee answers come from your source (legal/compliance) or need sub-100ms latency.

Explanation

Extractive QA treats question-answering as a span-selection problem: given a context document and question, the model finds the exact start and end positions of the answer within that context. Generative QA treats it as a sequence-to-sequence problem: the model generates token-by-token the answer from scratch, using the context as conditioning input but not constrained to it.

Mechanically, extractive QA uses models like BERT fine-tuned on SQuAD (two classification heads: one for start token, one for end token). Generative QA uses encoder-decoder models (like BART, T5, or LLMs) that output free-form text. Extractive runs in milliseconds; generative requires sampling/decoding and can hallucinate. Extractive answers are always substring-exact from the context; generative answers may paraphrase, summarize, or confabulate.

Use extractive when: answers exist verbatim in your documents, you need speed, or you need provenance (users trust answers more when they can see the source). Use generative when: you need reasoning, multi-document synthesis, or answers that require rephrasing.

Analogy

Extractive QA is like a Find dialog in a document: fast, exact, but limited to what's already there. Generative QA is like asking a colleague to explain it: flexible and conversational, but they might get creative or misremember.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline

# Extractive QA using transformers 5.5.x
context = "Paris is the capital of France. It is located in northern France and is known for the Eiffel Tower."
question = "What is the capital of France?"

# Initialize extractive QA pipeline with pinned model
extraction_pipe = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2",
    device=0 if torch.cuda.is_available() else -1
)

result_extractive = extraction_pipe(question=question, context=context)
print("Extractive Result:")
print(f"Answer: {result_extractive['answer']}")
print(f"Score: {result_extractive['score']:.4f}")
print(f"Start: {result_extractive['start']}, End: {result_extractive['end']}")
print()

# Generative QA using a sequence-to-sequence model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

gen_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
gen_model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-small",
    device_map="auto",
    torch_dtype=torch.float32
)

# Format input for generative QA
gen_input = f"question: {question} context: {context}"
inputs = gen_tokenizer(gen_input, return_tensors="pt", max_length=512, truncation=True)

# Move to same device as model
device = next(gen_model.parameters()).device
inputs = {k: v.to(device) for k, v in inputs.items()}

output_ids = gen_model.generate(**inputs, max_length=50)
result_generative = gen_tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("Generative Result:")
print(f"Answer: {result_generative}")
print()

# Comparison: same QA, different approaches
print("=" * 50)
print("COMPARISON:")
print(f"Extractive found: '{result_extractive['answer']}' (verbatim from context)")
print(f"Generative produced: '{result_generative}' (model-generated)")
print(f"Extractive confidence: {result_extractive['score']:.2%}")
print(f"Extractive can pinpoint location: [{result_extractive['start']}:{result_extractive['end']}]")

Output

Extractive Result:
Answer: Paris
Score: 0.9987
Start: 0, End: 5

Generative Result:
Answer: Paris is the capital of France.

==================================================
COMPARISON:
Extractive found: 'Paris' (verbatim from context)
Generative produced: 'Paris is the capital of France.' (model-generated)
Extractive confidence: 99.87%
Extractive can pinpoint location: [0:5]

What just happened?

The code loaded two different QA systems. The extractive system (RoBERTa on SQuAD) scanned the context, found token positions for 'Paris', and returned exact span coordinates with confidence. The generative system (FLAN-T5) ingested the question and context as a prompt, ran the decoder autoregressively to produce 'Paris is the capital of France.': a complete sentence the model constructed, not found. The extractive answer is guaranteed to be a substring; the generative answer may reformat or synthesize.

Common gotcha

Developers assume generative QA is 'better' because it sounds more human. In reality, generative QA will confidently hallucinate an answer that sounds plausible but doesn't exist in your documents. Extractive QA can return an empty/low-confidence result if the answer isn't there: which is honest. For regulated use cases (finance, legal, medical), extractive + source attribution is safer. For open-domain QA, generative is necessary but requires guardrails.

Error recovery

ValueError: Token indices sequence length is longer than the maximum (512)

Your context is too long for the model's max_position_embeddings. Truncate context with max_length parameter in tokenizer() or use a longer-context model like longformer.

RuntimeError: Expected all tensors to be on the same device

You moved the question/context to GPU but the model is on CPU (or vice versa). Explicitly move inputs to the model's device using .to(device) as shown: pipeline() handles this automatically, but manual models require it.

AttributeError: 'NoneType' object has no attribute 'to'

Your model didn't load (typo in model name or network issue). Verify model exists on Hugging Face Hub and use device_map='auto' for transformers 5.5.x to auto-place on available hardware.

Experienced dev note

In transformers 4.x, you could get away with pipeline('question-answering') without pinning a model and it would download a default. In 5.5.x, this is discouraged: always pass explicit model name. More importantly: measure extractive QA on your actual documents first (it's 50-100x faster). Many teams jump to generative because it 'feels' smarter, spend months on prompt engineering, then realize extractive would have solved it in a week. Extractive also lets you version-control which span is the answer (for debugging); generative answers are black-box token sequences.

Check your understanding

You have a customer-support QA system where answers must always be traceable to your company's knowledge base (for compliance). You notice generative QA sometimes returns answers that sound right but don't appear in your documents. Why is this happening, and which approach should you switch to?

Show answer hint

A correct answer explains that generative models construct tokens autoregressively and can hallucinate plausible text outside the training/context distribution. Extractive QA is required here because it constrains answers to the knowledge base by design: it can only select spans that exist, making it auditable. The key insight is that 'making sense' and 'being grounded' are different properties.

VERSION transformers 5.5.x deprecates passing model=None to pipeline() (the 4.x default-download behavior). Always specify model name explicitly. Additionally, device_map='auto' is now the standard for direct model loading (AutoModel), replacing the older device placement patterns. The pipeline() function auto-handles device placement but explicit models require it.

Next, explore how to fine-tune an extractive QA model on your own domain dataset using the Trainer API and custom QA datasets.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.