Custom NER model loading
Why this matters
Most real-world NER tasks require models trained on your domain-specific data. You need to know how to load these models, handle model weights efficiently, and configure them for production inference without wasting GPU memory or taking hours to download weights unnecessarily.
Explanation
What it is: Custom NER model loading means retrieving a fine-tuned token classification model from Hugging Face Hub (or local storage) and preparing it for inference. Unlike pre-trained models, these have been trained on labeled entity data for your specific use case: medical records, legal documents, social media, etc.
How it works mechanically: When you call AutoModelForTokenClassification.from_pretrained(), transformers downloads the model weights and config from Hub, reconstructs the model architecture, and optionally applies device_map='auto' to split layers across available GPUs if the model exceeds a single device's memory. The tokenizer must match the model's training tokenizer: mismatched vocabularies produce garbage predictions. For inference, you either use the pipeline() API (which handles tokenization and post-processing automatically) or manually tokenize, pass through the model, and decode token predictions back to entity spans using the label2id mapping.
When to use it: Use this when you have domain-specific NER tasks where generic models fail: financial entity extraction, biomedical named entities, or multilingual documents. This is the production-grade pattern for any NER work beyond toy examples.
Analogy
Loading a custom NER model is like importing a specialized contractor who's trained on your specific building type. A general contractor (base model) can handle many jobs, but a specialist who's worked on 100 hospitals (fine-tuned model) will recognize architectural patterns your generic contractor misses. You need to hire the right person (load correct model) and brief them with your project specs (provide correct tokenizer).
Code
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
model_name,
device_map='auto',
torch_dtype=torch.bfloat16
)
ner_pipeline = pipeline(
'token-classification',
model=model,
tokenizer=tokenizer,
device=0 if torch.cuda.is_available() else -1,
aggregation_strategy='simple'
)
text = "Hugging Face is a company based in New York. Thomas Scialom works there."
results = ner_pipeline(text)
for entity in results:
print(f"{entity['word']}: {entity['entity_group']}") Face: ORG New: LOC York: LOC Thomas: PER Scialom: PER
What just happened?
The code loaded a BERT-based NER model (dslim/bert-base-NER, already fine-tuned on the CoNLL-2003 dataset) with mixed precision (bfloat16) to reduce memory. It created a token-classification pipeline that tokenizes your input text, passes it through the model, and aggregates subword tokens back into whole words before returning entity predictions with their types (ORG, LOC, PER). The output shows each recognized entity and its classification.
Common gotcha
The most common mistake: using a tokenizer from one model with a different model's weights. For example, loading `dslim/bert-base-NER` but accidentally instantiating `AutoTokenizer.from_pretrained('roberta-base')`. This works without error: the tokenizer will happily tokenize your text: but produces nonsense predictions because the vocab indices are misaligned. Always verify model_name == tokenizer source in production code.
Error recovery
RuntimeError: CUDA out of memoryValueError: Vocabulary size mismatchFileNotFoundError: Model not found on HubAssertionError: target_size != input_size in reshapeExperienced dev note
In transformers 4.x, you'd often see code that manually placed models on device with `.to(device)`. In 5.5.x, always use `device_map='auto'` instead: it's smarter about sharding large models and prevents CUDA memory fragmentation. Also: token-classification pipelines with `aggregation_strategy='simple'` (default) join subwords naively; use `'average'` or `'first'` if you need precise span positions for downstream processing. One more thing: NER models are tokenizer-dependent. If you fine-tune on custom data, you must load that exact fine-tuned checkpoint; the base model weights alone won't capture your domain entities.
Check your understanding
If you load a model fine-tuned on medical entity recognition, but accidentally use the tokenizer from the base BERT model (different vocabulary), why would the predictions still run without error but be incorrect? What's actually misaligned?
Show answer hint
A correct answer explains that the tokenizer converts words to token IDs, and the model's embedding layer expects IDs that correspond to its specific vocabulary. If the tokenizer produces ID 1042 for 'disease' but the model was never trained on that ID (it was trained with a different tokenizer where ID 1042 means something else), the embeddings are looking up the wrong learned representations. The vocab size mismatch is silent because both tokenizers produce valid integers; the semantic mismatch is invisible until you see bad predictions.