How to Intermediate · 3 min read

How to use LORA merge for deployment

Quick answer
Use LoRA merge to combine low-rank adaptation weights with the base model weights, creating a single deployable model without runtime LoRA overhead. This involves loading the base model and LoRA weights, merging them (e.g., via peft library), and saving the merged model for deployment.

PREREQUISITES

  • Python 3.8+
  • pip install torch transformers peft
  • Access to base model and LoRA fine-tuned weights

Setup

Install necessary libraries for loading and merging LoRA weights with the base model.

bash
pip install torch transformers peft

Step by step

This example shows how to load a base Hugging Face model and LoRA weights, merge them into a single model, and save it for deployment.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)

# Load LoRA fine-tuned model
lora_model_path = "./lora_finetuned"
lora_model = PeftModel.from_pretrained(base_model, lora_model_path)

# Merge LoRA weights into base model
merged_model = lora_model.merge_and_unload()

# Save merged model for deployment
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

# Test merged model inference
inputs = tokenizer("Hello, LoRA merged model!", return_tensors="pt")
outputs = merged_model.generate(**inputs, max_length=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Hello, LoRA merged model!

Common variations

  • Use different base models like gpt-neo or llama with compatible LoRA weights.
  • Perform merging on GPU for faster execution by moving models to CUDA device.
  • Use async inference frameworks by exporting merged model to ONNX or TorchScript.

Troubleshooting

  • If you get shape mismatch errors, verify the base model and LoRA weights are compatible versions.
  • If merge_and_unload() is not available, update the peft library to the latest version.
  • Ensure you have enough GPU memory or switch to CPU if out of memory during merging.

Key Takeaways

  • Merge LoRA weights into the base model to simplify deployment and reduce runtime overhead.
  • Use the peft library's merge_and_unload() method for an easy merge process.
  • Always verify model and LoRA weight compatibility to avoid shape mismatch errors.
Verified 2026-04 · gpt2, peft
Verify ↗