How to use QLoRA with Llama
Quick answer
Use
BitsAndBytesConfig with load_in_4bit=True and LoraConfig from peft to apply QLoRA on Llama models. Load the base Llama model with quantization, then wrap it with get_peft_model to enable efficient fine-tuning with low-rank adapters.PREREQUISITES
Python 3.8+pip install transformers>=4.30.0pip install peftpip install bitsandbytespip install torch (with CUDA support recommended)
Setup
Install the required Python packages for Llama model loading, quantization, and QLoRA fine-tuning.
pip install transformers peft bitsandbytes torch Step by step
This example shows how to load a Llama model with 4-bit quantization and apply QLoRA using peft. It includes loading the model, configuring LoRA, and preparing for fine-tuning.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from bitsandbytes import load_in_4bit
import torch
# Load tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load base model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto",
torch_dtype=torch.float16
)
# Define LoRA configuration for QLoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
# Wrap model with LoRA adapters
model = get_peft_model(model, lora_config)
# Example input
inputs = tokenizer("Hello, how can I use QLoRA with Llama?", return_tensors="pt").to(model.device)
# Forward pass
outputs = model(**inputs)
logits = outputs.logits
print("Logits shape:", logits.shape) output
Logits shape: torch.Size([1, 11, 32000])
Common variations
- Use
load_in_8bit=Trueinfrom_pretrainedfor 8-bit quantization instead of 4-bit. - Adjust
LoraConfigparameters likerandlora_alphafor different fine-tuning trade-offs. - Use
device_map="cpu" for CPU-only environments (slower). - For async or distributed training, integrate with
accelerateortorch.distributed.
Troubleshooting
- If you see
RuntimeError: CUDA out of memory, reduce batch size or use smallerrinLoraConfig. - Ensure
bitsandbytesis installed with CUDA support matching your GPU. - If tokenizer loading fails, verify the model name and internet connection.
- For
load_in_4biterrors, updatetransformersandbitsandbytesto latest versions.
Key Takeaways
- Use
BitsAndBytesConfigorload_in_4bit=Trueto enable 4-bit quantization forLlamamodels. - Configure
LoraConfigwith appropriate parameters to applyQLoRAfor efficient fine-tuning. - Wrap the quantized model with
get_peft_modelto enable low-rank adaptation training. - Adjust quantization and LoRA parameters based on hardware constraints and fine-tuning needs.
- Keep
transformers,peft, andbitsandbytesupdated to avoid compatibility issues.