Code Intermediate medium · 6 min

Custom NER model loading

What you will learn

Load a fine-tuned Named Entity Recognition model from Hugging Face and use it for inference on custom text.

Why this matters

Most real-world NER tasks require models trained on your domain-specific data. You need to know how to load these models, handle model weights efficiently, and configure them for production inference without wasting GPU memory or taking hours to download weights unnecessarily.

Skip if: You should NOT use this approach when: (1) you're building a quick prototype and the base `distilbert-base-uncased` NER model is sufficient, (2) you need real-time streaming predictions and must use ONNX or TensorRT instead, (3) you're running inference on CPUs with <4GB RAM and need to quantize to 8-bit first.

Explanation

What it is: Custom NER model loading means retrieving a fine-tuned token classification model from Hugging Face Hub (or local storage) and preparing it for inference. Unlike pre-trained models, these have been trained on labeled entity data for your specific use case: medical records, legal documents, social media, etc.

How it works mechanically: When you call AutoModelForTokenClassification.from_pretrained(), transformers downloads the model weights and config from Hub, reconstructs the model architecture, and optionally applies device_map='auto' to split layers across available GPUs if the model exceeds a single device's memory. The tokenizer must match the model's training tokenizer: mismatched vocabularies produce garbage predictions. For inference, you either use the pipeline() API (which handles tokenization and post-processing automatically) or manually tokenize, pass through the model, and decode token predictions back to entity spans using the label2id mapping.

When to use it: Use this when you have domain-specific NER tasks where generic models fail: financial entity extraction, biomedical named entities, or multilingual documents. This is the production-grade pattern for any NER work beyond toy examples.

Analogy

Loading a custom NER model is like importing a specialized contractor who's trained on your specific building type. A general contractor (base model) can handle many jobs, but a specialist who's worked on 100 hospitals (fine-tuned model) will recognize architectural patterns your generic contractor misses. You need to hire the right person (load correct model) and brief them with your project specs (provide correct tokenizer).

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.bfloat16
)

ner_pipeline = pipeline(
    'token-classification',
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    aggregation_strategy='simple'
)

text = "Hugging Face is a company based in New York. Thomas Scialom works there."
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['word']}: {entity['entity_group']}")

Output

Face: ORG
New: LOC
York: LOC
Thomas: PER
Scialom: PER

What just happened?

The code loaded a BERT-based NER model (dslim/bert-base-NER, already fine-tuned on the CoNLL-2003 dataset) with mixed precision (bfloat16) to reduce memory. It created a token-classification pipeline that tokenizes your input text, passes it through the model, and aggregates subword tokens back into whole words before returning entity predictions with their types (ORG, LOC, PER). The output shows each recognized entity and its classification.

Common gotcha

The most common mistake: using a tokenizer from one model with a different model's weights. For example, loading `dslim/bert-base-NER` but accidentally instantiating `AutoTokenizer.from_pretrained('roberta-base')`. This works without error: the tokenizer will happily tokenize your text: but produces nonsense predictions because the vocab indices are misaligned. Always verify model_name == tokenizer source in production code.

Error recovery

RuntimeError: CUDA out of memory

Your model weights exceed GPU VRAM. Fix: Add `device_map='cpu'` to load on CPU instead, or use `from transformers import BitsAndBytesConfig; bnb_config = BitsAndBytesConfig(load_in_8bit=True)` to quantize before loading. Pass `quantization_config=bnb_config` to from_pretrained().

ValueError: Vocabulary size mismatch

The tokenizer's vocab size doesn't match the model's embedding layer. This happens when model and tokenizer come from different training runs. Fix: Always use `AutoTokenizer.from_pretrained(model_name)` with the exact same `model_name` string as your model.

FileNotFoundError: Model not found on Hub

The model name doesn't exist on Hugging Face Hub, or you're offline. Fix: Verify the model exists at huggingface.co/models, or download it first with `git clone https://huggingface.co/model_name` and pass a local path to from_pretrained().

AssertionError: target_size != input_size in reshape

Sequence length is too long for the model's max_position_embeddings (usually 512 for BERT). Fix: Truncate input: `tokenizer(text, truncation=True, max_length=512)`. Or use a model with longer context like 'allenai/longformer-base-4096'.

Experienced dev note

In transformers 4.x, you'd often see code that manually placed models on device with `.to(device)`. In 5.5.x, always use `device_map='auto'` instead: it's smarter about sharding large models and prevents CUDA memory fragmentation. Also: token-classification pipelines with `aggregation_strategy='simple'` (default) join subwords naively; use `'average'` or `'first'` if you need precise span positions for downstream processing. One more thing: NER models are tokenizer-dependent. If you fine-tune on custom data, you must load that exact fine-tuned checkpoint; the base model weights alone won't capture your domain entities.

Check your understanding

If you load a model fine-tuned on medical entity recognition, but accidentally use the tokenizer from the base BERT model (different vocabulary), why would the predictions still run without error but be incorrect? What's actually misaligned?

Show answer hint

A correct answer explains that the tokenizer converts words to token IDs, and the model's embedding layer expects IDs that correspond to its specific vocabulary. If the tokenizer produces ID 1042 for 'disease' but the model was never trained on that ID (it was trained with a different tokenizer where ID 1042 means something else), the embeddings are looking up the wrong learned representations. The vocab size mismatch is silent because both tokenizers produce valid integers; the semantic mismatch is invisible until you see bad predictions.

VERSION In transformers < 5.0.0, you could instantiate models without device_map and manually call `.to(device)`. This pattern is deprecated in 5.5.x; `device_map='auto'` is now the standard for any model over ~1GB. Also, in 4.x the default for AutoModel.from_pretrained() did NOT specify torch_dtype, leading to float32 overhead. In 5.5.x with quantization configs, always explicitly set torch_dtype=torch.bfloat16 or torch.float16 for GPU efficiency.

Next, learn how to fine-tune a transformer for NER on your own labeled dataset and push it to Hugging Face Hub so you can load it the same way.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.