How to Intermediate · 3 min read

How to use QLoRA with BitsAndBytes

Q: How to use QLoRA with BitsAndBytes

Use BitsAndBytesConfig to load your model in 4-bit precision and combine it with LoraConfig from the peft library to apply QLoRA fine-tuning. This setup reduces memory usage while enabling efficient training of large language models.

Quick answer

Use BitsAndBytesConfig to load your model in 4-bit precision and combine it with LoraConfig from the peft library to apply QLoRA fine-tuning. This setup reduces memory usage while enabling efficient training of large language models.

PREREQUISITES

Python 3.8+
pip install transformers>=4.30.0
pip install peft
pip install bitsandbytes
pip install torch (with CUDA support recommended)

Setup

Install the required libraries for QLoRA and BitsAndBytes. Ensure you have a compatible GPU and CUDA installed for best performance.

bash

pip install transformers peft bitsandbytes torch

Step by step

This example shows how to load a large language model with 4-bit quantization using BitsAndBytesConfig and apply QLoRA with LoraConfig for fine-tuning.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load tokenizer and model with quantization
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

# Define LoRA configuration for QLoRA fine-tuning
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA to the quantized model
model = get_peft_model(model, lora_config)

# Example input
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)

# Forward pass
outputs = model(**inputs)
logits = outputs.logits

print(f"Logits shape: {logits.shape}")

output

Logits shape: torch.Size([1, 6, 32000])

Common variations

Use load_in_8bit=True in BitsAndBytesConfig for 8-bit quantization instead of 4-bit.
Adjust r and lora_alpha in LoraConfig to control LoRA rank and scaling.
Use different target modules depending on the model architecture (e.g., ["query_key_value"] for some models).
For async training or inference, integrate with frameworks like accelerate or deepspeed.

Troubleshooting

If you get CUDA out-of-memory errors, reduce batch size or use gradient checkpointing.
Ensure your transformers and bitsandbytes versions are compatible.
If bnb_4bit_quant_type is unsupported, try "fp4" or omit the parameter.
Verify your GPU supports 4-bit quantization (NVIDIA Ampere or newer recommended).

✅

Key Takeaways

Combine BitsAndBytesConfig with LoraConfig to enable memory-efficient QLoRA fine-tuning.
Use 4-bit quantization to drastically reduce GPU memory usage while maintaining model performance.
Adjust LoRA parameters and target modules based on your model and task for optimal results.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗