Code beginner · 3 min read

How to run Phi-3 with Hugging Face in python

Q: How to run Phi-3 with Hugging Face in python

Use Hugging Face's transformers library or the huggingface_hub Inference API client to load and run the phi-3 model by specifying its model ID and passing input text for generation.

Direct answer

Use Hugging Face's transformers library or the huggingface_hub Inference API client to load and run the phi-3 model by specifying its model ID and passing input text for generation.

Setup

Install

bash

pip install transformers huggingface_hub

Env vars

HUGGINGFACE_API_TOKEN

Imports

python

from huggingface_hub import InferenceClient

# or alternatively
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

Examples

inWrite a short poem about spring.

outSpring breathes life anew, Blossoms dance in morning dew, Warmth paints skies blue.

inExplain quantum computing in simple terms.

outQuantum computing uses quantum bits that can be both 0 and 1 at the same time, allowing it to solve certain problems much faster than regular computers.

outError: Input text cannot be empty.

Integration steps

Install the required libraries with pip.
Set your Hugging Face API token in the environment variable HUGGINGFACE_API_TOKEN.
Initialize the Hugging Face InferenceClient with your API token.
Call the text_generation method with the phi-3 model ID and your input prompt.
Extract and print the generated text from the response.

Full code

python

import os
from huggingface_hub import InferenceClient

# Set your Hugging Face API token in environment variable HUGGINGFACE_API_TOKEN
api_token = os.environ.get('HUGGINGFACE_API_TOKEN')
if not api_token:
    raise ValueError('Please set the HUGGINGFACE_API_TOKEN environment variable.')

client = InferenceClient(token=api_token)

model_id = "OpenAssistant/phi-3"

prompt = "Explain the benefits of renewable energy."

response = client.text_generation(model=model_id, inputs=prompt, max_new_tokens=100)

print("Generated text:")
print(response.generated_text)

output

Generated text:
Renewable energy offers numerous benefits including reducing greenhouse gas emissions, decreasing dependence on fossil fuels, and promoting sustainable development.

API trace

Request

json

{"model": "OpenAssistant/phi-3", "inputs": "Explain the benefits of renewable energy.", "max_new_tokens": 100}

Response

json

{"generated_text": "Renewable energy offers numerous benefits including reducing greenhouse gas emissions..."}

Extractresponse.generated_text

Variants

Using transformers local model loading ›

Use this variant if you want to run Phi-3 locally without calling the Hugging Face Inference API, assuming you have the model weights downloaded.

python

import os
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "OpenAssistant/phi-3"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Explain the benefits of renewable energy."
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Async call with huggingface_hub InferenceClient ›

Use async variant to integrate Phi-3 calls in asynchronous Python applications for better concurrency.

python

import os
import asyncio
from huggingface_hub import InferenceClient

async def main():
    api_token = os.environ.get('HUGGINGFACE_API_TOKEN')
    client = InferenceClient(token=api_token)
    model_id = "OpenAssistant/phi-3"
    prompt = "Explain the benefits of renewable energy."
    response = await client.text_generation(model=model_id, inputs=prompt, max_new_tokens=100)
    print(response.generated_text)

asyncio.run(main())

Performance

Latency~1-3 seconds per request depending on prompt length and server load

CostCheck Hugging Face pricing; usage depends on model size and tokens generated

Rate limitsDepends on your Hugging Face subscription; typically 30-60 requests per minute for free tier

Limit <code>max_new_tokens</code> to reduce cost and latency.
Use concise prompts to minimize input tokens.
Cache frequent prompts and responses to avoid repeated calls.

Approach	Latency	Cost/call	Best for
Hugging Face Inference API	~1-3s	Variable, pay per token	Quick deployment without local resources
Local transformers model	Depends on hardware, ~100ms-1s	Free after download	Offline use and customization
Async API calls	~1-3s with concurrency	Variable	High throughput applications

✓

Quick tip

Always set <code>max_new_tokens</code> to control output length and avoid unexpectedly long generations with Phi-3.

⚠

Common mistake

Forgetting to set the <code>HUGGINGFACE_API_TOKEN</code> environment variable causes authentication errors when calling the Inference API.

Verified 2026-04 · OpenAssistant/phi-3

Verify ↗