How to Intermediate · 3 min read

How to use QLoRA with BitsAndBytes

Quick answer
Use BitsAndBytesConfig to load your model in 4-bit precision and combine it with LoraConfig from the peft library to apply QLoRA fine-tuning. This setup reduces memory usage while enabling efficient training of large language models.

PREREQUISITES

  • Python 3.8+
  • pip install transformers>=4.30.0
  • pip install peft
  • pip install bitsandbytes
  • pip install torch (with CUDA support recommended)

Setup

Install the required libraries for QLoRA and BitsAndBytes. Ensure you have a compatible GPU and CUDA installed for best performance.

bash
pip install transformers peft bitsandbytes torch

Step by step

This example shows how to load a large language model with 4-bit quantization using BitsAndBytesConfig and apply QLoRA with LoraConfig for fine-tuning.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load tokenizer and model with quantization
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

# Define LoRA configuration for QLoRA fine-tuning
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA to the quantized model
model = get_peft_model(model, lora_config)

# Example input
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)

# Forward pass
outputs = model(**inputs)
logits = outputs.logits

print(f"Logits shape: {logits.shape}")
output
Logits shape: torch.Size([1, 6, 32000])

Common variations

  • Use load_in_8bit=True in BitsAndBytesConfig for 8-bit quantization instead of 4-bit.
  • Adjust r and lora_alpha in LoraConfig to control LoRA rank and scaling.
  • Use different target modules depending on the model architecture (e.g., ["query_key_value"] for some models).
  • For async training or inference, integrate with frameworks like accelerate or deepspeed.

Troubleshooting

  • If you get CUDA out-of-memory errors, reduce batch size or use gradient checkpointing.
  • Ensure your transformers and bitsandbytes versions are compatible.
  • If bnb_4bit_quant_type is unsupported, try "fp4" or omit the parameter.
  • Verify your GPU supports 4-bit quantization (NVIDIA Ampere or newer recommended).

Key Takeaways

  • Combine BitsAndBytesConfig with LoraConfig to enable memory-efficient QLoRA fine-tuning.
  • Use 4-bit quantization to drastically reduce GPU memory usage while maintaining model performance.
  • Adjust LoRA parameters and target modules based on your model and task for optimal results.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗