Code beginner · 3 min read

How to run Llama 3 with Hugging Face in python

Q: How to run Llama 3 with Hugging Face in python

Use the Hugging Face Transformers library with AutoTokenizer and AutoModelForCausalLM to load and run Llama 3 models in Python.

Direct answer

Use the Hugging Face Transformers library with AutoTokenizer and AutoModelForCausalLM to load and run Llama 3 models in Python.

Setup

Install

bash

pip install transformers torch

Imports

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

Examples

inGenerate text with Llama 3 model for prompt: 'Hello, world!'

outHello, world! This is a sample output generated by Llama 3 model using Hugging Face.

inGenerate text with Llama 3 model for prompt: 'The future of AI is'

outThe future of AI is incredibly promising, with advancements in natural language understanding and generation.

inGenerate text with Llama 3 model for prompt: '' (empty prompt)

outThe model returns a default or minimal continuation depending on configuration.

Integration steps

Install the required libraries: transformers and torch.
Import AutoTokenizer and AutoModelForCausalLM from transformers.
Load the Llama 3 tokenizer and model using their Hugging Face model ID.
Prepare the input prompt and tokenize it.
Generate output tokens using the model's generate method.
Decode the generated tokens back to text and print the result.

Full code

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load tokenizer and model for Llama 3
model_name = "meta-llama/Llama-3-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Prepare prompt
prompt = "Hello, world!"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode and print
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

output

Hello, world! This is a sample output generated by Llama 3 model using Hugging Face.

API trace

Request

json

{"model_name": "meta-llama/Llama-3-7b-hf", "inputs": {"input_ids": [...], "attention_mask": [...]}, "max_new_tokens": 50}

Response

json

{"generated_token_ids": [...], "sequences": [...]}

ExtractUse tokenizer.decode(generated_token_ids[0], skip_special_tokens=True) to get the generated text.

Variants

Streaming generation with Hugging Face Accelerate ›

Use streaming to display tokens as they are generated for better user experience in interactive apps.

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-3-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

prompt = "Hello, world!"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Stream tokens one by one
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, streamer=None)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Run Llama 3 with CPU only ›

Use CPU-only mode when GPU is unavailable or for small-scale testing.

python

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Hello, world!"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Use smaller Llama 3 variant for faster inference ›

Use smaller Llama 3 models to reduce memory usage and speed up inference at some quality tradeoff.

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-3-3b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

prompt = "Hello, world!"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=50)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Performance

Latency~2-5 seconds per 50 tokens on a single high-end GPU for Llama 3 7B

CostDepends on cloud GPU usage; locally free but requires powerful hardware

Rate limitsNo API rate limits when running locally; cloud-hosted Hugging Face Inference API has its own limits

Use max_new_tokens to limit generation length and reduce latency.
Use fp16 precision to reduce memory and speed up inference.
Cache the tokenizer and model objects to avoid reload overhead.

Approach	Latency	Cost/call	Best for
Standard generation (fp16, GPU)	~2-5s per 50 tokens	Local hardware cost	High-quality generation with GPU
CPU-only generation	~10-30s per 50 tokens	Local hardware cost	Testing or low-resource environments
Streaming generation	Token-by-token latency ~100ms	Local hardware cost	Interactive applications

✓

Quick tip

Always use device_map="auto" with large Llama 3 models to automatically distribute model layers across available GPUs.

⚠

Common mistake

Trying to load Llama 3 models without specifying torch_dtype=torch.float16 and device_map causes out-of-memory errors on GPUs.

Verified 2026-04 · meta-llama/Llama-3-7b-hf, meta-llama/Llama-3-3b-hf

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.