Fine-tuning for structured extraction
Why this matters
Production systems need predictable structured outputs: extracting fields from documents, parsing API responses, or converting unstructured text into databases. Fine-tuning a model specifically for this task eliminates expensive prompt engineering and hallucination-prone parsing logic, while constrained decoding ensures output always matches your schema.
Explanation
What it is: Fine-tuning an LLM to output structured data (JSON, XML, YAML) by training on examples where the model learns both the task logic and the format constraints. Unlike few-shot prompting, the model internalizes schema patterns, reducing hallucination and improving consistency.
How it works mechanically: You prepare a dataset where each example is an input (unstructured text) paired with a target output in your desired schema. During training, the model learns to map inputs to structured outputs via next-token prediction. At inference time, you layer on constrained decoding (e.g., via outlines or vLLM grammar modes) to force the model to generate only valid tokens that respect your schema, preventing malformed JSON or missing fields.
When to use it: Use this when you have sufficient labeled data (500+), your schema is stable, and you need both reliability (schema validity) and efficiency (no post-generation parsing).
Analogy
It's like training a mail carrier to sort letters not just by learning postal zones, but by physically requiring them to place each letter in a labeled bin: the constrained decoding is the physical bin that prevents them from putting mail in the wrong slot, no matter how distracted they get.
Code
import json
from typing import Optional
from dataclasses import dataclass
from datasets import Dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
@dataclass
class ExtractionExample:
text: str
schema: dict
def create_extraction_dataset():
"""Create a minimal extraction dataset: invoice parsing."""
examples = [
{
"text": "Invoice #12345 dated 2024-01-15 from Acme Corp for $1500 due on 2024-02-15.",
"output": '{"invoice_id": "12345", "date": "2024-01-15", "vendor": "Acme Corp", "amount": 1500, "due_date": "2024-02-15"}'
},
{
"text": "Bill #67890 from TechSupply dated March 3rd, 2024, total $2300, pay by April 3rd, 2024.",
"output": '{"invoice_id": "67890", "date": "2024-03-03", "vendor": "TechSupply", "amount": 2300, "due_date": "2024-04-03"}'
},
{
"text": "Invoice 11111 from GlobalCorp on 2024-02-20 requesting payment of $999 by 2024-03-20.",
"output": '{"invoice_id": "11111", "date": "2024-02-20", "vendor": "GlobalCorp", "amount": 999, "due_date": "2024-03-20"}'
},
{
"text": "Purchase order PO-55555 issued by MegaTrade on January 10, 2024, value $5000, settlement due January 31, 2024.",
"output": '{"invoice_id": "55555", "date": "2024-01-10", "vendor": "MegaTrade", "amount": 5000, "due_date": "2024-01-31"}'
},
{
"text": "Invoice #99999 from SmallBiz dated 2024-04-01, amount $450, due 2024-05-01.",
"output": '{"invoice_id": "99999", "date": "2024-04-01", "vendor": "SmallBiz", "amount": 450, "due_date": "2024-05-01"}'
},
]
formatted_data = []
for ex in examples:
prompt = f"""Extract invoice details as JSON.
Text: {ex['text']}
JSON:"""
formatted_data.append({
"text": prompt + ex["output"]
})
return Dataset.from_dict({"text": [d["text"] for d in formatted_data]})
def fine_tune_for_extraction():
"""Fine-tune a small model for structured extraction using SFT."""
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
dataset = create_extraction_dataset()
lora_config = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
training_args = SFTConfig(
output_dir="./extraction_model",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
learning_rate=2e-4,
max_seq_length=256,
logging_steps=1,
save_steps=100,
seed=42,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset,
peft_config=lora_config,
)
print("Starting fine-tuning for structured extraction...")
trainer.train()
print("Fine-tuning complete.")
print(f"Model saved to {training_args.output_dir}")
return trainer, tokenizer
if __name__ == "__main__":
trainer, tokenizer = fine_tune_for_extraction() Starting fine-tuning for structured extraction... [Training progresses through 3 epochs, logging loss at each step] Fine-tuning complete. Model saved to ./extraction_model
What just happened?
The code created a small dataset of invoice extraction tasks (unstructured text paired with JSON outputs), then used SFTTrainer with LoRA to fine-tune GPT-2 on these examples. The trainer optimized the model to map invoice text to JSON fields. No constrained decoding was applied in this example: in production, you'd layer that on via the inference pipeline (using `outlines` or vLLM) to guarantee valid JSON output.
Common gotcha
The biggest mistake: assuming fine-tuning alone guarantees valid JSON. A fine-tuned model is probabilistically more likely to output valid structure, but it can still hallucinate invalid JSON (mismatched braces, missing commas, wrong types). You must pair fine-tuning with constrained decoding at inference time. Fine-tuning teaches the model the pattern; constrained decoding prevents it from breaking the pattern.
Error recovery
ValueError: Trying to instantiate a Trainer from the same configuration file.RuntimeError: Expected all tensors to be on the same device.KeyError: 'text' when building the datasetExperienced dev note
The false economy: fine-tuning 100 examples with poor quality is worse than zero-shot prompting with good prompt engineering. Spend time cleaning and validating your labeled data: structural extraction tasks are brittle to noisy labels. Also, fine-tune on the exact inference setup you'll use: if you're quantizing to 4-bit at inference, fine-tune with quantization enabled too. Model drift between training and inference is a real production killer for structured tasks.
Check your understanding
If your fine-tuned model generates `{"invoice_id": "12345", "date": "2024-01-15", "vendor": "Acme", "amount": 1500.50, "due_date": "2024-02-15", "extra_field": "oops"}` at inference time (note the extra_field), what likely went wrong and how would you fix it in a production system?
Show answer hint
A correct answer recognizes: (1) the model hallucinated an extra field not in the schema (fine-tuning doesn't guarantee schema compliance), (2) the fix is to use <strong>constrained decoding</strong> at inference (grammar-based generation via outlines or vLLM) to restrict the model to your exact schema, and (3) this demonstrates why fine-tuning + constraints are a pair, not a replacement for each other.