How to beginner · 3 min read

How to use serverless inference on Hugging Face

Quick answer
Use the Hugging Face Inference API for serverless inference by sending HTTP requests or using the official Python client huggingface_hub. Authenticate with your API token and call the pipeline or inference_api methods to run models without managing servers.

PREREQUISITES

  • Python 3.8+
  • Hugging Face account with API token
  • pip install huggingface_hub>=0.14.1

Setup

Install the official Hugging Face Python client and set your API token as an environment variable for authentication.

bash
pip install huggingface_hub>=0.14.1

Step by step

This example demonstrates serverless inference using the Hugging Face huggingface_hub Python client to call a text generation model hosted on Hugging Face's Inference API.

python
import os
from huggingface_hub import InferenceApi

# Set your Hugging Face API token in environment variable
api_token = os.environ["HF_API_TOKEN"]

# Initialize the Inference API client for a model (e.g., GPT-2)
inference = InferenceApi(repo_id="gpt2", token=api_token)

# Input prompt for generation
input_text = "The future of AI is"

# Call the model serverlessly
output = inference(inputs=input_text)

print("Model output:", output)
output
Model output: [{'generated_text': 'The future of AI is very promising and will continue to evolve rapidly.'}]

Common variations

  • Use different models by changing repo_id to any Hugging Face model ID.
  • For zero-shot classification or other tasks, pass task-specific inputs.
  • Use the pipeline method from transformers with use_auth_token for more complex workflows.
python
from huggingface_hub import InferenceApi

# Zero-shot classification example
inference = InferenceApi(repo_id="facebook/bart-large-mnli", token=os.environ["HF_API_TOKEN"])

sequence_to_classify = "I love using Hugging Face models!"
candidate_labels = ["positive", "negative", "neutral"]

result = inference({"sequence": sequence_to_classify, "labels": candidate_labels})
print(result)
output
{"labels": ["positive", "neutral", "negative"], "scores": [0.98, 0.01, 0.01]}

Troubleshooting

  • If you get authentication errors, verify your HF_API_TOKEN environment variable is set correctly.
  • For rate limits or timeouts, check your Hugging Face plan and reduce request frequency.
  • Model not found errors mean the repo_id is incorrect or the model is private.

Key Takeaways

  • Use the Hugging Face Inference API for serverless model calls without managing infrastructure.
  • Authenticate with your Hugging Face API token stored in environment variables.
  • Change repo_id to switch models easily for different AI tasks.
  • Handle common errors by verifying tokens, model IDs, and respecting rate limits.
Verified 2026-04 · gpt2, facebook/bart-large-mnli
Verify ↗