How to run Phi-3 with Hugging Face in python
Direct answer
Use Hugging Face's
transformers library or the huggingface_hub Inference API client to load and run the phi-3 model by specifying its model ID and passing input text for generation.Setup
Install
pip install transformers huggingface_hub Env vars
HUGGINGFACE_API_TOKEN Imports
from huggingface_hub import InferenceClient
# or alternatively
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch Examples
inWrite a short poem about spring.
outSpring breathes life anew, Blossoms dance in morning dew, Warmth paints skies blue.
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be both 0 and 1 at the same time, allowing it to solve certain problems much faster than regular computers.
in
outError: Input text cannot be empty.
Integration steps
- Install the required libraries with pip.
- Set your Hugging Face API token in the environment variable HUGGINGFACE_API_TOKEN.
- Initialize the Hugging Face InferenceClient with your API token.
- Call the
text_generationmethod with thephi-3model ID and your input prompt. - Extract and print the generated text from the response.
Full code
import os
from huggingface_hub import InferenceClient
# Set your Hugging Face API token in environment variable HUGGINGFACE_API_TOKEN
api_token = os.environ.get('HUGGINGFACE_API_TOKEN')
if not api_token:
raise ValueError('Please set the HUGGINGFACE_API_TOKEN environment variable.')
client = InferenceClient(token=api_token)
model_id = "OpenAssistant/phi-3"
prompt = "Explain the benefits of renewable energy."
response = client.text_generation(model=model_id, inputs=prompt, max_new_tokens=100)
print("Generated text:")
print(response.generated_text) output
Generated text: Renewable energy offers numerous benefits including reducing greenhouse gas emissions, decreasing dependence on fossil fuels, and promoting sustainable development.
API trace
Request
{"model": "OpenAssistant/phi-3", "inputs": "Explain the benefits of renewable energy.", "max_new_tokens": 100} Response
{"generated_text": "Renewable energy offers numerous benefits including reducing greenhouse gas emissions..."} Extract
response.generated_textVariants
Using transformers local model loading ›
Use this variant if you want to run Phi-3 locally without calling the Hugging Face Inference API, assuming you have the model weights downloaded.
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "OpenAssistant/phi-3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Explain the benefits of renewable energy."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) Async call with huggingface_hub InferenceClient ›
Use async variant to integrate Phi-3 calls in asynchronous Python applications for better concurrency.
import os
import asyncio
from huggingface_hub import InferenceClient
async def main():
api_token = os.environ.get('HUGGINGFACE_API_TOKEN')
client = InferenceClient(token=api_token)
model_id = "OpenAssistant/phi-3"
prompt = "Explain the benefits of renewable energy."
response = await client.text_generation(model=model_id, inputs=prompt, max_new_tokens=100)
print(response.generated_text)
asyncio.run(main()) Performance
Latency~1-3 seconds per request depending on prompt length and server load
CostCheck Hugging Face pricing; usage depends on model size and tokens generated
Rate limitsDepends on your Hugging Face subscription; typically 30-60 requests per minute for free tier
- Limit <code>max_new_tokens</code> to reduce cost and latency.
- Use concise prompts to minimize input tokens.
- Cache frequent prompts and responses to avoid repeated calls.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Hugging Face Inference API | ~1-3s | Variable, pay per token | Quick deployment without local resources |
| Local transformers model | Depends on hardware, ~100ms-1s | Free after download | Offline use and customization |
| Async API calls | ~1-3s with concurrency | Variable | High throughput applications |
Quick tip
Always set <code>max_new_tokens</code> to control output length and avoid unexpectedly long generations with Phi-3.
Common mistake
Forgetting to set the <code>HUGGINGFACE_API_TOKEN</code> environment variable causes authentication errors when calling the Inference API.