How to Intermediate · 4 min read

QLoRA with Hugging Face Trainer

Q: QLoRA with Hugging Face Trainer

Use QLoRA to fine-tune large language models efficiently by combining LoRA adapters with 4-bit quantization. The Hugging Face Trainer supports this by loading a quantized base model with BitsAndBytesConfig and applying peft library's LoraConfig for parameter-efficient tuning.

Quick answer

Use QLoRA to fine-tune large language models efficiently by combining LoRA adapters with 4-bit quantization. The Hugging Face Trainer supports this by loading a quantized base model with BitsAndBytesConfig and applying peft library's LoraConfig for parameter-efficient tuning.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0 peft bitsandbytes datasets
Access to a GPU with at least 8GB VRAM recommended

Setup

Install the required libraries: transformers for model and training, peft for LoRA support, bitsandbytes for 4-bit quantization, and datasets for data loading.

bash

pip install transformers peft bitsandbytes datasets

Step by step

This example shows how to load a 4-bit quantized LLaMA model, apply QLoRA with peft, and fine-tune it using the Hugging Face Trainer.

python

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, Trainer, TrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model

# Load tokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16"
)

# Load base model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# Setup LoRA config for QLoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Load dataset (example: wikitext)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Training arguments
training_args = TrainingArguments(
    output_dir="./qlora-llama2",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    evaluation_strategy="no",
    report_to="none"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# Start training
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./qlora-llama2")
tokenizer.save_pretrained("./qlora-llama2")

output

***** Running training *****
  Num examples = 36718
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, accumulation) = 32
  Gradient Accumulation steps = 8
  Total optimization steps = 3441
...
Saving model checkpoint to ./qlora-llama2
Training completed.

Common variations

Use Trainer with DataCollatorForSeq2Seq for seq2seq models.
Adjust lora_config parameters like r and lora_alpha for different trade-offs.
Use bitsandbytes 8-bit quantization by changing load_in_4bit=False and load_in_8bit=True.
For async or distributed training, integrate with accelerate or deepspeed.

Troubleshooting

If you get CUDA out-of-memory errors, reduce batch size or use gradient accumulation.
Ensure bitsandbytes is installed correctly and compatible with your CUDA version.
Check that target_modules in LoraConfig matches your model architecture.
Use device_map="auto" to automatically place model layers on GPUs.

✅

Key Takeaways

QLoRA combines 4-bit quantization with LoRA adapters for efficient fine-tuning of large models.
Use Hugging Face's Trainer with BitsAndBytesConfig and peft's LoraConfig for seamless integration.
Adjust batch size and gradient accumulation to fit GPU memory constraints during training.

Verified 2026-04 · meta-llama/Llama-2-7b-chat-hf

Verify ↗