How to Intermediate · 4 min read

Fine-tuning Llama with Hugging Face

Q: Fine-tuning Llama with Hugging Face

Use Hugging Face Transformers with peft and bitsandbytes to fine-tune Llama models efficiently. Load the pretrained Llama model with 4-bit quantization, apply LoRA adapters, and train on your dataset using Trainer from transformers.

Quick answer

Use Hugging Face Transformers with peft and bitsandbytes to fine-tune Llama models efficiently. Load the pretrained Llama model with 4-bit quantization, apply LoRA adapters, and train on your dataset using Trainer from transformers.

PREREQUISITES

Python 3.8+
pip install transformers peft bitsandbytes datasets accelerate torch
Access to a GPU with CUDA (recommended)
Meta Llama model weights (local or Hugging Face Hub access)

Setup

Install the required Python packages for fine-tuning Llama models with Hugging Face:

transformers for model and training utilities
peft for parameter-efficient fine-tuning (LoRA)
bitsandbytes for 4-bit quantization support
datasets for loading and processing datasets
accelerate for distributed training support
torch for PyTorch backend

Use the following command to install all dependencies:

bash

pip install transformers peft bitsandbytes datasets accelerate torch

Step by step

This example demonstrates fine-tuning a Llama 3 model with LoRA adapters using Hugging Face Transformers and PEFT. It uses 4-bit quantization to reduce memory usage and trains on a sample dataset.

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

# Load tokenizer
model_name = "meta-llama/Llama-3-7b"  # Replace with your model path or HF repo

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

# Load pretrained model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Load a sample dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Training arguments
training_args = TrainingArguments(
    output_dir="./llama-lora-finetuned",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="no",
    save_total_limit=1,
    fp16=True,
    optim="paged_adamw_32bit",
    report_to="none"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# Start training
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./llama-lora-finetuned")
tokenizer.save_pretrained("./llama-lora-finetuned")

print("Fine-tuning complete and model saved.")

output

***** Running training *****
  Num examples = 288
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 72
...
Fine-tuning complete and model saved.

Common variations

Async training: Use accelerate launch CLI for distributed or mixed precision training.
Different model sizes: Replace model_name with meta-llama/Llama-3-13b or smaller variants.
Full fine-tuning: Omit peft and load model without quantization for full parameter updates.
Streaming datasets: Use Hugging Face datasets streaming mode for large corpora.

Troubleshooting

If you see CUDA out of memory, reduce batch size or use 4-bit quantization.
If tokenizer errors occur, ensure use_fast=False when loading the tokenizer.
For bitsandbytes installation issues, verify CUDA toolkit compatibility.
If training is slow, enable mixed precision with fp16=True in TrainingArguments.

✅

Key Takeaways

Use 4-bit quantization with bitsandbytes to fine-tune large Llama models on limited GPU memory.
Apply LoRA adapters via peft to efficiently fine-tune only a small subset of parameters.
Leverage Hugging Face Trainer and datasets for streamlined training and data processing.

Verified 2026-04 · meta-llama/Llama-3-7b, meta-llama/Llama-3-13b

Verify ↗