Code beginner · 3 min read

How to run Llama 3 with Hugging Face in python

Direct answer
Use the Hugging Face Transformers library with AutoTokenizer and AutoModelForCausalLM to load and run Llama 3 models in Python.

Setup

Install
bash
pip install transformers torch
Imports
python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

Examples

inGenerate text with Llama 3 model for prompt: 'Hello, world!'
outHello, world! This is a sample output generated by Llama 3 model using Hugging Face.
inGenerate text with Llama 3 model for prompt: 'The future of AI is'
outThe future of AI is incredibly promising, with advancements in natural language understanding and generation.
inGenerate text with Llama 3 model for prompt: '' (empty prompt)
outThe model returns a default or minimal continuation depending on configuration.

Integration steps

  1. Install the required libraries: transformers and torch.
  2. Import AutoTokenizer and AutoModelForCausalLM from transformers.
  3. Load the Llama 3 tokenizer and model using their Hugging Face model ID.
  4. Prepare the input prompt and tokenize it.
  5. Generate output tokens using the model's generate method.
  6. Decode the generated tokens back to text and print the result.

Full code

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load tokenizer and model for Llama 3
model_name = "meta-llama/Llama-3-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Prepare prompt
prompt = "Hello, world!"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate output
outputs = model.generate(**inputs, max_new_tokens=50)

# Decode and print
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
output
Hello, world! This is a sample output generated by Llama 3 model using Hugging Face.

API trace

Request
json
{"model_name": "meta-llama/Llama-3-7b-hf", "inputs": {"input_ids": [...], "attention_mask": [...]}, "max_new_tokens": 50}
Response
json
{"generated_token_ids": [...], "sequences": [...]}
ExtractUse tokenizer.decode(generated_token_ids[0], skip_special_tokens=True) to get the generated text.

Variants

Streaming generation with Hugging Face Accelerate

Use streaming to display tokens as they are generated for better user experience in interactive apps.

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-3-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

prompt = "Hello, world!"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Stream tokens one by one
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, streamer=None)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
Run Llama 3 with CPU only

Use CPU-only mode when GPU is unavailable or for small-scale testing.

python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Hello, world!"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
Use smaller Llama 3 variant for faster inference

Use smaller Llama 3 models to reduce memory usage and speed up inference at some quality tradeoff.

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-3-3b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

prompt = "Hello, world!"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=50)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Performance

Latency~2-5 seconds per 50 tokens on a single high-end GPU for Llama 3 7B
CostDepends on cloud GPU usage; locally free but requires powerful hardware
Rate limitsNo API rate limits when running locally; cloud-hosted Hugging Face Inference API has its own limits
  • Use max_new_tokens to limit generation length and reduce latency.
  • Use fp16 precision to reduce memory and speed up inference.
  • Cache the tokenizer and model objects to avoid reload overhead.
ApproachLatencyCost/callBest for
Standard generation (fp16, GPU)~2-5s per 50 tokensLocal hardware costHigh-quality generation with GPU
CPU-only generation~10-30s per 50 tokensLocal hardware costTesting or low-resource environments
Streaming generationToken-by-token latency ~100msLocal hardware costInteractive applications

Quick tip

Always use device_map="auto" with large Llama 3 models to automatically distribute model layers across available GPUs.

Common mistake

Trying to load Llama 3 models without specifying torch_dtype=torch.float16 and device_map causes out-of-memory errors on GPUs.

Verified 2026-04 · meta-llama/Llama-3-7b-hf, meta-llama/Llama-3-3b-hf
Verify ↗