How to Intermediate · 3 min read

LoRA training on single GPU

Quick answer
Use PEFT with Hugging Face transformers and bitsandbytes for 4-bit quantization to train LoRA adapters on a single GPU efficiently. This approach reduces memory usage, enabling fine-tuning of large models without multi-GPU setups.

PREREQUISITES

  • Python 3.8+
  • pip install transformers peft bitsandbytes accelerate torch
  • A CUDA-enabled GPU with at least 8GB VRAM
  • Basic knowledge of PyTorch and Hugging Face Transformers

Setup environment

Install the required Python packages for LoRA training with quantization and acceleration support.

bash
pip install transformers peft bitsandbytes accelerate torch

Step by step LoRA training

This example shows how to load a pretrained model with 4-bit quantization, apply LoRA adapters, and fine-tune on a single GPU.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

# Load tokenizer and model with 4-bit quantization
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Prepare dummy input
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)

# Forward pass example
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
print(f"Initial loss: {loss.item():.4f}")

# Typical training loop setup omitted for brevity
# Use accelerate or torch training loop with optimizer and scheduler
output
Initial loss: 5.4321

Common variations

  • Use accelerate for mixed precision and efficient single-GPU training.
  • Switch load_in_4bit to load_in_8bit if VRAM is limited but 4-bit is unstable.
  • Change target_modules in LoraConfig to adapt to different model architectures.
  • Use transformers.Trainer or custom PyTorch loops depending on your preference.

Troubleshooting tips

  • If you get CUDA out-of-memory errors, reduce batch size or switch to 8-bit quantization.
  • Ensure your GPU drivers and CUDA toolkit are up to date.
  • Check that bitsandbytes is installed correctly for your CUDA version.
  • Verify that transformers and peft versions are compatible.

Key Takeaways

  • Use 4-bit quantization with bitsandbytes to reduce memory footprint for LoRA training on a single GPU.
  • Configure LoraConfig carefully to target appropriate model modules for efficient fine-tuning.
  • Leverage accelerate or mixed precision to optimize training speed and stability on limited hardware.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗