How to Intermediate · 3 min read

How to merge LoRA weights into base model

Quick answer
To merge LoRA weights into a base model, load the base model and the LoRA adapter using transformers and peft, then call model = get_peft_model(base_model, lora_config) followed by model = model.merge_and_unload(). This permanently integrates LoRA weights into the base model for efficient inference without extra adapters.

PREREQUISITES

  • Python 3.8+
  • pip install transformers peft torch
  • Basic knowledge of PyTorch and Hugging Face Transformers

Setup

Install the required Python packages transformers, peft, and torch to work with base models and LoRA adapters.

bash
pip install transformers peft torch

Step by step

This example shows how to load a base model and a LoRA adapter, then merge the LoRA weights into the base model for inference.

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
import os

# Load base model
base_model_name = "meta-llama/Llama-3.1-8B-Instruct"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.float16, device_map="auto")

# Load LoRA adapter
lora_model_path = "./lora_adapter"
lora_model = PeftModel.from_pretrained(base_model, lora_model_path)

# Merge LoRA weights into base model
merged_model = lora_model.merge_and_unload()

# Save merged model for deployment
merged_model.save_pretrained("./merged_model")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Test merged model inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = merged_model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Hello, how are you? I am doing well, thank you.

Common variations

  • Use load_in_4bit=True with BitsAndBytesConfig for quantized models.
  • Merge LoRA weights after fine-tuning to reduce inference overhead.
  • Use get_peft_model() to apply LoRA adapters before merging.

Troubleshooting

  • If you get CUDA out of memory errors, try using smaller batch sizes or load the model with device_map="auto".
  • If merge_and_unload() is not available, ensure you have the latest peft library installed.
  • Check that the LoRA adapter matches the base model architecture exactly.

Key Takeaways

  • Use merge_and_unload() from peft to permanently merge LoRA weights into the base model.
  • Merging LoRA weights reduces inference latency by removing adapter overhead.
  • Always verify LoRA adapter compatibility with the base model architecture before merging.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗