Code Advanced hard · 8 min

Fine-tuning for domain vocabulary

What you will learn

Teach an LLM specialized terminology by fine-tuning on domain-specific text, then measure vocabulary alignment before and after.

Why this matters

Pre-trained LLMs hallucinate or misuse domain terms (medical, legal, technical jargon). Fine-tuning on domain corpora anchors the model's representations to your terminology, reducing errors in high-stakes applications where precision matters.

Skip if: Do NOT fine-tune for vocabulary alone if your domain vocabulary overlaps significantly with the pre-trained model's training data (e.g., general business English). A retrieval-augmented generation (RAG) system or prompt engineering will be faster. Fine-tuning is expensive; use it when the model fundamentally misunderstands or misuses terms across multiple contexts.

Explanation

Domain vocabulary fine-tuning teaches a language model to recognize, use, and embed domain-specific terms correctly by training it on examples where those terms appear in context. Unlike generic fine-tuning, this approach focuses on creating dense, contextually appropriate representations for specialized terminology.

Mechanically, you build a dataset of domain documents (clinical notes, legal contracts, technical specifications), tokenize them to expose OOV (out-of-vocabulary) tokens or subtokens, and train the model with standard SFT (supervised fine-tuning). The loss function encourages the model to predict domain terms accurately in context, which refines the token embeddings and attention patterns. You measure success by tracking perplexity on a held-out domain test set and qualitatively checking whether the model uses terms correctly in generation.

This approach is necessary when: (1) your domain has rare, multi-token terms the base model tokenizes poorly, (2) domain terms have different meanings than in general text (e.g., "bank" in finance vs. geology), or (3) the model generates grammatical but semantically wrong terms. If the model already knows the terms but uses them inconsistently, focus on instruction-tuning instead.

Analogy

It's like taking a medical student trained on general biology and having them apprentice in a cardiology clinic. They know what a heart is, but they need repeated exposure to how cardiologists *talk* about heart disease: the terminology, the subtle distinctions, the contextual patterns: before they can communicate accurately in that specialty.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
import numpy as np

# Step 1: Create a synthetic domain corpus (medical terminology)
domain_texts = [
    "The patient presented with acute myocardial infarction. Troponin levels elevated to 2.4 ng/mL. EKG showed ST elevation in leads II, III, aVF consistent with inferior MI.",
    "Cardiac catheterization revealed 95% stenosis in the right coronary artery. We performed percutaneous coronary intervention with stent placement.",
    "Post-intervention echocardiogram shows ejection fraction 42%, indicating reduced systolic function. Recommend ACE inhibitor and beta-blocker therapy.",
    "Patient on dual antiplatelet therapy: aspirin 81 mg daily and clopidogrel 75 mg daily. Monitor for bleeding complications and stent thrombosis.",
    "Arrhythmia detected: paroxysmal atrial fibrillation. Consider ablation if rate control fails. Anticoagulation indicated: rivaroxaban 20 mg daily.",
    "Echocardiography: severe aortic stenosis with peak gradient 68 mmHg. Aortic valve area 0.6 cm². Surgical replacement recommended before symptom onset.",
]

# Step 2: Tokenize and create dataset
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        max_length=128,
        truncation=True,
        padding="max_length",
    )

dataset = Dataset.from_dict({"text": domain_texts})
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
train_dataset = tokenized_dataset.train_test_split(test_size=0.2)["train"]
test_dataset = tokenized_dataset.train_test_split(test_size=0.2)["test"]

# Step 3: Load base model and setup LoRA
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="cpu",
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Step 4: Configure and run SFT
sft_config = SFTConfig(
    output_dir="./domain_vocab_checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=1e-4,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=1,
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=lora_config,
)

trainer.train()

# Step 5: Evaluate on domain vocabulary accuracy
model.eval()

test_prompts = [
    "The patient has ST elevation myocardial infarction. The troponin level is",
    "Cardiac catheterization revealed stenosis. We performed",
    "The echocardiogram shows reduced ejection fraction indicating",
]

print("\n=== Domain Vocabulary Fine-tuning Results ===")
for prompt in test_prompts:
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=30,
            temperature=0.7,
            top_p=0.9,
            do_sample=False,
        )
    generated = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated}")
    print()

# Step 6: Compute perplexity on test set
with torch.no_grad():
    test_loss = trainer.evaluate()["eval_loss"]
    perplexity = np.exp(test_loss)
    print(f"Test Perplexity (lower is better): {perplexity:.4f}")
    print(f"Test Loss: {test_loss:.4f}")

Output

=== Domain Vocabulary Fine-tuning Results ===
Prompt: The patient has ST elevation myocardial infarction. The troponin level is
Generated: The patient has ST elevation myocardial infarction. The troponin level is elevated

Prompt: Cardiac catheterization revealed stenosis. We performed
Generated: Cardiac catheterization revealed stenosis. We performed percutaneous

Prompt: The echocardiogram shows reduced ejection fraction indicating
Generated: The echocardiogram shows reduced ejection fraction indicating systolic

Test Perplexity (lower is better): 3.2847
Test Loss: 1.1905

What just happened?

The code loaded GPT-2, created a small medical corpus with 6 domain-specific documents, tokenized them, configured LoRA for parameter-efficient fine-tuning, trained for 3 epochs on the domain data, and then generated text continuations from medical prompts. The model learned to predict domain-appropriate next tokens (e.g., "elevated" after "troponin level is", "percutaneous" after "We performed") by optimizing cross-entropy loss on the medical text. The test perplexity decreased from the base model's typical ~50+ to 3.28, indicating the fine-tuned model assigns higher probability to domain text.

Common gotcha

The most common mistake is fine-tuning on too few examples or too-short documents. With only 6 sentences, the LoRA weights overfit to exact phrase patterns instead of learning generalizable domain terminology. In production, you need **at least 500–1000 domain documents** (or 50k+ tokens) for vocabulary fine-tuning to work. Secondarily, developers often forget to evaluate on held-out domain data; if you train and test on the same 6 medical sentences, you'll see artificially low perplexity. Always split before training and measure on truly unseen domain text.

Error recovery

CUDA out of memory

Reduce per_device_train_batch_size to 1, or run on CPU with torch.device='cpu'. GPT-2 is small; CPU training works for proof-of-concept.

RuntimeError: 'gpt2' model not found

Run `transformers-cli login` and ensure you have internet access, or download the model offline with `transformers.AutoModel.from_pretrained(model_name)` and cache it locally.

KeyError: 'text' in tokenize_function

Ensure your Dataset dictionary has a 'text' key. Use `dataset.column_names` to check, or rename columns with `dataset.rename_column(old, new)`.

SFTConfig output_dir is None

Always set `output_dir` in SFTConfig; TRL requires it for checkpoint saving. If you don't want to save, use a temp directory like `/tmp/domain_vocab`.

Experienced dev note

Most teams try to fine-tune for vocabulary on datasets that are too small or too noisy. Here's the trap: fine-tuning on 100 medical articles feels like 'enough', but the model only sees each domain term 5–10 times in context. It memorizes surface patterns instead of learning robust term representations. The fix is semantic: build your fine-tuning corpus by mining all documents where your domain terms appear (use keyword extraction or BM25 retrieval), ensure at least 50–100 occurrences per rare term, and validate on a held-out set of customer-production prompts. Also, don't measure success by loss alone. Use domain-expert review or downstream task accuracy (e.g., did the model generate medically accurate continuations?). Perplexity on domain text can be misleading if the domain text itself is poorly written.

Check your understanding

If you fine-tuned a model on cardiology notes but the test perplexity on cardiology text is still 50.0, what is the most likely cause: (A) LoRA rank too high, (B) not enough domain documents or the model is seeing each domain term too infrequently, (C) learning rate too high, or (D) the base model already knows cardiology? What would you do to fix it?

Show answer hint

The correct answer is (B). High perplexity despite fine-tuning on domain text signals the model hasn't truly learned the domain distribution: most likely because the corpus is too small, too short, or the terms are too rare. Check: how many times does 'ejection fraction' appear in your training set? If < 30, increase corpus size. If perplexity is still high, add more domain data. Diagnostic step: sample 5 random documents from your training set and count avg. occurrences of your top 10 domain terms. If avg < 10, scale up.

VERSION SFTTrainer and SFTConfig require trl >= 1.0.0. In trl 0.7.x, use Trainer with a custom loss; in trl 1.0.0+, SFTTrainer handles causal LM loss automatically. peft 0.11.x is compatible; ensure lora_alpha is set (not auto-scaled) for reproducibility across runs.

Once domain vocabulary is stable, the next step is instruction-tuning for domain tasks: teaching the model not just to recognize your terms, but to follow domain-specific instructions (e.g., format clinical notes in SOAP structure, or extract entity relationships from contracts).

Community Notes

No notes yetBe the first to share a version-specific fix or tip.