Code Advanced hard · 8 min

Fine-tuning for structured extraction

What you will learn
Teach an LLM to reliably output structured data (JSON, XML) by fine-tuning on examples with constrained generation and schema validation during inference.

Why this matters

Production systems need predictable structured outputs: extracting fields from documents, parsing API responses, or converting unstructured text into databases. Fine-tuning a model specifically for this task eliminates expensive prompt engineering and hallucination-prone parsing logic, while constrained decoding ensures output always matches your schema.

Skip if: Don't fine-tune for structured extraction if: (1) you have fewer than 500 high-quality labeled examples: few-shot prompting may be sufficient, (2) your schema changes frequently: a prompt-based system is more maintainable, (3) you need real-time single-inference latency below 100ms and can't afford fine-tuned model quantization overhead, or (4) you're using a model that doesn't support constrained decoding in your inference framework.

Explanation

What it is: Fine-tuning an LLM to output structured data (JSON, XML, YAML) by training on examples where the model learns both the task logic and the format constraints. Unlike few-shot prompting, the model internalizes schema patterns, reducing hallucination and improving consistency.

How it works mechanically: You prepare a dataset where each example is an input (unstructured text) paired with a target output in your desired schema. During training, the model learns to map inputs to structured outputs via next-token prediction. At inference time, you layer on constrained decoding (e.g., via outlines or vLLM grammar modes) to force the model to generate only valid tokens that respect your schema, preventing malformed JSON or missing fields.

When to use it: Use this when you have sufficient labeled data (500+), your schema is stable, and you need both reliability (schema validity) and efficiency (no post-generation parsing).

Analogy

It's like training a mail carrier to sort letters not just by learning postal zones, but by physically requiring them to place each letter in a labeled bin: the constrained decoding is the physical bin that prevents them from putting mail in the wrong slot, no matter how distracted they get.

Code

Illustrative only - not runnable without a valid API key
python
import json
from typing import Optional
from dataclasses import dataclass
from datasets import Dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

@dataclass
class ExtractionExample:
    text: str
    schema: dict

def create_extraction_dataset():
    """Create a minimal extraction dataset: invoice parsing."""
    examples = [
        {
            "text": "Invoice #12345 dated 2024-01-15 from Acme Corp for $1500 due on 2024-02-15.",
            "output": '{"invoice_id": "12345", "date": "2024-01-15", "vendor": "Acme Corp", "amount": 1500, "due_date": "2024-02-15"}'
        },
        {
            "text": "Bill #67890 from TechSupply dated March 3rd, 2024, total $2300, pay by April 3rd, 2024.",
            "output": '{"invoice_id": "67890", "date": "2024-03-03", "vendor": "TechSupply", "amount": 2300, "due_date": "2024-04-03"}'
        },
        {
            "text": "Invoice 11111 from GlobalCorp on 2024-02-20 requesting payment of $999 by 2024-03-20.",
            "output": '{"invoice_id": "11111", "date": "2024-02-20", "vendor": "GlobalCorp", "amount": 999, "due_date": "2024-03-20"}'
        },
        {
            "text": "Purchase order PO-55555 issued by MegaTrade on January 10, 2024, value $5000, settlement due January 31, 2024.",
            "output": '{"invoice_id": "55555", "date": "2024-01-10", "vendor": "MegaTrade", "amount": 5000, "due_date": "2024-01-31"}'
        },
        {
            "text": "Invoice #99999 from SmallBiz dated 2024-04-01, amount $450, due 2024-05-01.",
            "output": '{"invoice_id": "99999", "date": "2024-04-01", "vendor": "SmallBiz", "amount": 450, "due_date": "2024-05-01"}'
        },
    ]
    
    formatted_data = []
    for ex in examples:
        prompt = f"""Extract invoice details as JSON.
Text: {ex['text']}
JSON:"""
        formatted_data.append({
            "text": prompt + ex["output"]
        })
    
    return Dataset.from_dict({"text": [d["text"] for d in formatted_data]})

def fine_tune_for_extraction():
    """Fine-tune a small model for structured extraction using SFT."""
    
    model_name = "gpt2"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    dataset = create_extraction_dataset()
    
    lora_config = LoraConfig(
        r=8,
        lora_alpha=16,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    training_args = SFTConfig(
        output_dir="./extraction_model",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=2,
        learning_rate=2e-4,
        max_seq_length=256,
        logging_steps=1,
        save_steps=100,
        seed=42,
    )
    
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=dataset,
        peft_config=lora_config,
    )
    
    print("Starting fine-tuning for structured extraction...")
    trainer.train()
    print("Fine-tuning complete.")
    print(f"Model saved to {training_args.output_dir}")
    
    return trainer, tokenizer

if __name__ == "__main__":
    trainer, tokenizer = fine_tune_for_extraction()
Output
Starting fine-tuning for structured extraction...
[Training progresses through 3 epochs, logging loss at each step]
Fine-tuning complete.
Model saved to ./extraction_model

What just happened?

The code created a small dataset of invoice extraction tasks (unstructured text paired with JSON outputs), then used SFTTrainer with LoRA to fine-tune GPT-2 on these examples. The trainer optimized the model to map invoice text to JSON fields. No constrained decoding was applied in this example: in production, you'd layer that on via the inference pipeline (using `outlines` or vLLM) to guarantee valid JSON output.

Common gotcha

The biggest mistake: assuming fine-tuning alone guarantees valid JSON. A fine-tuned model is probabilistically more likely to output valid structure, but it can still hallucinate invalid JSON (mismatched braces, missing commas, wrong types). You must pair fine-tuning with constrained decoding at inference time. Fine-tuning teaches the model the pattern; constrained decoding prevents it from breaking the pattern.

Error recovery

ValueError: Trying to instantiate a Trainer from the same configuration file.
You're likely running the script twice and the output_dir already has a trainer state file. Either delete ./extraction_model or change output_dir to a new path.
RuntimeError: Expected all tensors to be on the same device.
LoRA config is set up but the model and training tensors are on different devices (CPU vs GPU). Ensure CUDA is available or explicitly set `device_map='cpu'` in AutoModelForCausalLM.from_pretrained().
KeyError: 'text' when building the dataset
The dataset dictionary must have a 'text' key. The SFTTrainer expects a 'text' column by default unless you specify `dataset_text_field='your_column_name'` in SFTConfig.

Experienced dev note

The false economy: fine-tuning 100 examples with poor quality is worse than zero-shot prompting with good prompt engineering. Spend time cleaning and validating your labeled data: structural extraction tasks are brittle to noisy labels. Also, fine-tune on the exact inference setup you'll use: if you're quantizing to 4-bit at inference, fine-tune with quantization enabled too. Model drift between training and inference is a real production killer for structured tasks.

Check your understanding

If your fine-tuned model generates `{"invoice_id": "12345", "date": "2024-01-15", "vendor": "Acme", "amount": 1500.50, "due_date": "2024-02-15", "extra_field": "oops"}` at inference time (note the extra_field), what likely went wrong and how would you fix it in a production system?

Show answer hint

A correct answer recognizes: (1) the model hallucinated an extra field not in the schema (fine-tuning doesn't guarantee schema compliance), (2) the fix is to use <strong>constrained decoding</strong> at inference (grammar-based generation via outlines or vLLM) to restrict the model to your exact schema, and (3) this demonstrates why fine-tuning + constraints are a pair, not a replacement for each other.

VERSION SFTConfig and SFTTrainer are stable in trl >= 0.8.0. If using trl < 0.8.0, use the older SFT trainer API. peft >= 0.4.0 required for LoraConfig on causal LM. transformers >= 4.36.0 recommended for best compatibility with constrained decoding libraries.
NEXT

Next, explore quantization-aware fine-tuning to keep structured extraction models small and fast without sacrificing schema compliance during inference.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.