How to Intermediate · 4 min read

How to use QLoRA with Llama

Q: How to use QLoRA with Llama

Use BitsAndBytesConfig with load_in_4bit=True and LoraConfig from peft to apply QLoRA on Llama models. Load the base Llama model with quantization, then wrap it with get_peft_model to enable efficient fine-tuning with low-rank adapters.

Quick answer

Use BitsAndBytesConfig with load_in_4bit=True and LoraConfig from peft to apply QLoRA on Llama models. Load the base Llama model with quantization, then wrap it with get_peft_model to enable efficient fine-tuning with low-rank adapters.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
pip install peft
pip install bitsandbytes
pip install torch (with CUDA support recommended)

Setup

Install the required Python packages for Llama model loading, quantization, and QLoRA fine-tuning.

bash

pip install transformers peft bitsandbytes torch

Step by step

This example shows how to load a Llama model with 4-bit quantization and apply QLoRA using peft. It includes loading the model, configuring LoRA, and preparing for fine-tuning.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from bitsandbytes import load_in_4bit
import torch

# Load tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load base model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

# Define LoRA configuration for QLoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Wrap model with LoRA adapters
model = get_peft_model(model, lora_config)

# Example input
inputs = tokenizer("Hello, how can I use QLoRA with Llama?", return_tensors="pt").to(model.device)

# Forward pass
outputs = model(**inputs)
logits = outputs.logits

print("Logits shape:", logits.shape)

output

Logits shape: torch.Size([1, 11, 32000])

Common variations

Use load_in_8bit=True in from_pretrained for 8-bit quantization instead of 4-bit.
Adjust LoraConfig parameters like r and lora_alpha for different fine-tuning trade-offs.
Use device_map="cpu" for CPU-only environments (slower).
For async or distributed training, integrate with accelerate or torch.distributed.

Troubleshooting

If you see RuntimeError: CUDA out of memory, reduce batch size or use smaller r in LoraConfig.
Ensure bitsandbytes is installed with CUDA support matching your GPU.
If tokenizer loading fails, verify the model name and internet connection.
For load_in_4bit errors, update transformers and bitsandbytes to latest versions.

✅

Key Takeaways

Use BitsAndBytesConfig or load_in_4bit=True to enable 4-bit quantization for Llama models.
Configure LoraConfig with appropriate parameters to apply QLoRA for efficient fine-tuning.
Wrap the quantized model with get_peft_model to enable low-rank adaptation training.
Adjust quantization and LoRA parameters based on hardware constraints and fine-tuning needs.
Keep transformers, peft, and bitsandbytes updated to avoid compatibility issues.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗