Code Advanced medium · 6 min

tokenizer.push_to_hub()

What you will learn

Upload a trained or custom tokenizer to Hugging Face Hub with a single method call, making it instantly shareable and reproducible across teams.

Why this matters

Custom tokenizers trained on domain-specific data (legal text, code, medical records) need to be versioned and shared. Without <code>push_to_hub()</code>, you're emailing pickle files or managing separate storage: losing reproducibility and making collaboration fragile. Hub-hosted tokenizers auto-load with your model, preventing tokenizer-model mismatches in production.

Skip if: Don't use <code>push_to_hub()</code> if your tokenizer is a standard pretrained one (e.g., GPT-2, BERT) already on the Hub: you don't need to re-upload it. Also skip it in isolated research environments where you never collaborate or deploy; local files are fine there. Never use it to upload tokenizers containing proprietary training data without legal review.

Explanation

What it is: push_to_hub() is a method on tokenizer objects (after transformers 5.0) that serializes your tokenizer to the Hugging Face Model Hub: the same registry where models live: under a repo you control or co-own. It uploads the tokenizer config, vocab files, and merge files (for BPE) in a single call.

How it works mechanically: The method requires Hub credentials (via huggingface_hub.login()) and a repo ID in format username/repo-name. It serializes the tokenizer using the same format .save_pretrained() uses locally, but streams it to Hub instead. Subsequent calls with commit_message create versioned commits, and the Hub generates a model card with usage instructions automatically. Your tokenizer becomes instantly loadable via AutoTokenizer.from_pretrained("username/repo-name") anywhere on the internet.

When to use it: After training a custom tokenizer on domain data, or after modifying a pretrained one's config (e.g., adding special tokens). Use it whenever you want other developers (or your future self on another machine) to load your exact tokenizer without re-training or file copying.

Analogy

Imagine you've tuned your restaurant's secret sauce recipe. <code>push_to_hub()</code> is publishing it to a global recipe registry with versioning: any chef anywhere can now use exactly your recipe, and if you refine it next month, they can access v2 without confusion.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from transformers import AutoTokenizer, BertTokenizer
from huggingface_hub import login

login(token="hf_your_actual_token_here")

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer.add_tokens(["<DOMAIN>", "<ENTITY>"], special_tokens=False)

repo_id = "your-username/custom-domain-tokenizer"

tokenizer.push_to_hub(
    repo_id=repo_id,
    commit_message="Add domain-specific tokens to BERT tokenizer",
    private=False
)

print(f"Tokenizer pushed to: https://huggingface.co/{repo_id}")

loaded_tokenizer = AutoTokenizer.from_pretrained(repo_id)
text = "This is a <DOMAIN> example with <ENTITY> token."
encoded = loaded_tokenizer(text, return_tensors="pt")
print(f"Token IDs: {encoded['input_ids']}")
print(f"Tokens: {loaded_tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])}")

Output

Tokenizer pushed to: https://huggingface.co/your-username/custom-domain-tokenizer
Token IDs: tensor([[  101,  2054,  2003,  1037, 28889, 2889,  2007, 28890, 2102,  1012,   102]])
Tokens: ['[CLS]', 'this', 'is', 'a', '<DOMAIN>', 'example', 'with', '<ENTITY>', 'token', '.', '[SEP]']

What just happened?

The code loaded a pretrained BERT tokenizer, added two custom tokens to its vocabulary, serialized it and uploaded it to the Hugging Face Hub under a named repo, then immediately re-loaded it from the Hub to verify it worked. The loaded tokenizer correctly tokenized text containing the new domain-specific tokens, encoding them as new token IDs (shown as 28889 and 28890: indices beyond the original BERT vocab of ~30k).

Common gotcha

The most common mistake: forgetting to call login() before push_to_hub(): you'll get OSError: "Token is required". The second gotcha: pushing to a repo you don't own without write permissions. The third: not realizing that push_to_hub() creates the repo if it doesn't exist (when using a new repo ID), but it doesn't happen silently: check Hub after the call to confirm. Also, if you modify the tokenizer after pushing and push again without a new commit_message, it overwrites the previous commit: use meaningful messages to track changes.

Error recovery

HfHubHTTPError 403

You don't have write permission to that repo. Solution: use a repo under your own username, or ask the repo owner to add you as a collaborator. Check your <code>repo_id</code> spelling.

OSError: Token is required

You didn't authenticate. Solution: call <code>huggingface_hub.login()</code> before <code>push_to_hub()</code>, or set env var <code>HF_TOKEN="your_token"</code>.

ValueError: Tokenizer class mismatch

Rare: happens if you serialize with one tokenizer class but Hub detects a different one. Solution: ensure you're using the correct AutoTokenizer or explicit class (e.g., <code>BertTokenizer</code>), and check the tokenizer's <code>.config</code> attribute for consistency.

HTTPError 422: Unprocessable Entity

Repo name is invalid (contains uppercase, special chars). Solution: use only lowercase alphanumerics and hyphens in repo name (e.g., <code>my-domain-tokenizer</code>, not <code>My-Domain-Tokenizer</code>).

Experienced dev note

The silent footgun: your team trains a custom tokenizer, pushes it to Hub, then 6 months later someone loads it and gets different results because the tokenizer was pushed without the exact same special tokens config it was trained with. Always version your tokenizer: use semantic versioning in the repo name (my-tokenizer-v2) or git-style commits to track training hyperparams. Better: document in the Hub card (auto-generated, but editable) exactly what data it was trained on. Also, pushing to a private=True repo requires a paid Hub account tier; free users can only push public repos. Plan accordingly.

Check your understanding

If you push a custom tokenizer to the Hub, then six months later a colleague loads it with AutoTokenizer.from_pretrained(), modify the tokenizer locally (add a special token), and push again without changing the repo ID: what happens to the original version on the Hub, and how would you preserve it?

Show answer hint

A correct answer explains that the new push overwrites the previous commit on the Hub (unless you use git-style branches, which tokenizers don't natively support in Hub). To preserve the original, you'd either: use a different repo ID (e.g., add <code>-v2</code>), use Hub's snapshot/release feature (via the UI or <code>huggingface_hub.create_tag()</code>), or commit with a meaningful message and document versions in the README. The key insight: Hub repos are not version-control systems for tokenizers: they're registries. Versioning is manual and requires discipline.

VERSION In transformers < 5.0, push_to_hub() required importing from huggingface_hub separately and was less integrated. In 5.0+, it's a native method on all tokenizer objects. The API signature is stable in 5.5.x but private parameter behavior changed in 5.1 to match Hub's free-tier restrictions: always check your Hub account tier.

Learn how to add custom special tokens with <code>add_special_tokens()</code> and ensure they're preserved when pushing: this prevents mismatches between your training tokenizer and the Hub version.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.