QLoRA with Hugging Face Trainer
Quick answer
Use
QLoRA to fine-tune large language models efficiently by combining LoRA adapters with 4-bit quantization. The Hugging Face Trainer supports this by loading a quantized base model with BitsAndBytesConfig and applying peft library's LoraConfig for parameter-efficient tuning.PREREQUISITES
Python 3.8+pip install transformers>=4.30.0 peft bitsandbytes datasetsAccess to a GPU with at least 8GB VRAM recommended
Setup
Install the required libraries: transformers for model and training, peft for LoRA support, bitsandbytes for 4-bit quantization, and datasets for data loading.
pip install transformers peft bitsandbytes datasets Step by step
This example shows how to load a 4-bit quantized LLaMA model, apply QLoRA with peft, and fine-tune it using the Hugging Face Trainer.
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, Trainer, TrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
# Load tokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16"
)
# Load base model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
# Setup LoRA config for QLoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# Load dataset (example: wikitext)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
# Tokenize function
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
# Training arguments
training_args = TrainingArguments(
output_dir="./qlora-llama2",
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=100,
save_total_limit=2,
evaluation_strategy="no",
report_to="none"
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset
)
# Start training
trainer.train()
# Save the fine-tuned model
model.save_pretrained("./qlora-llama2")
tokenizer.save_pretrained("./qlora-llama2") output
***** Running training ***** Num examples = 36718 Num Epochs = 3 Instantaneous batch size per device = 4 Total train batch size (w. parallel, accumulation) = 32 Gradient Accumulation steps = 8 Total optimization steps = 3441 ... Saving model checkpoint to ./qlora-llama2 Training completed.
Common variations
- Use
TrainerwithDataCollatorForSeq2Seqfor seq2seq models. - Adjust
lora_configparameters likerandlora_alphafor different trade-offs. - Use
bitsandbytes8-bit quantization by changingload_in_4bit=Falseandload_in_8bit=True. - For async or distributed training, integrate with
accelerateordeepspeed.
Troubleshooting
- If you get CUDA out-of-memory errors, reduce batch size or use gradient accumulation.
- Ensure
bitsandbytesis installed correctly and compatible with your CUDA version. - Check that
target_modulesinLoraConfigmatches your model architecture. - Use
device_map="auto"to automatically place model layers on GPUs.
Key Takeaways
- QLoRA combines 4-bit quantization with LoRA adapters for efficient fine-tuning of large models.
- Use Hugging Face's Trainer with BitsAndBytesConfig and peft's LoraConfig for seamless integration.
- Adjust batch size and gradient accumulation to fit GPU memory constraints during training.