Code Intermediate medium · 5 min

Merging LoRA weights back into the base model

What you will learn

Convert a LoRA adapter back into a single, standalone model file by merging learned weights into the base model's parameters.

Why this matters

LoRA adapters are efficient for training but require both the base model and adapter at inference. Merging produces a single deployable model that doesn't need special loading code, simplifies versioning, and reduces operational complexity in production.

Skip if: Don't merge if you need to version-control or A/B test multiple fine-tuned variants alongside the original base model. Merging is destructive: you can't recover the adapter afterward. Also skip merging if you're serving multiple variants in a single inference service; keep them as separate adapters instead.

Explanation

What it is: Merging takes the learned LoRA weights (the small delta matrices) and fuses them permanently into the base model's parameters. The result is a standard model file with no adapter artifacts.

How it works mechanically: LoRA training learns low-rank matrices A and B such that the weight update is approximately ΔW = B * A. The merged model computes W_merged = W_original + (B * A * scaling_factor) for every LoRA layer. After merging, the adapter is discarded and you have a single model checkpoint that behaves identically to base+adapter at inference but loads like any standard model.

When to use it: Merge when you're ready to ship: after validation is complete and you've confirmed the merged model performs identically to the adapter version. Keep adapters during experimentation; merge for production deployment.

Analogy

It's like baking a cake: during development, you keep the frosting (LoRA) separate so you can scrape it off and try a different flavor. Once you're happy, you frost the cake permanently. You can't remove that frosting without destroying the cake, so you only do it when you're certain it's the final version.

Code

python

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

model_name = "meta-llama/Llama-2-7b-hf"
adapter_model_name = "./my-lora-adapter"  # Path to fine-tuned LoRA adapter

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoPeftModelForCausalLM.from_pretrained(
    adapter_model_name,
    device_map="auto",
    torch_dtype=torch.float16,
)

print(f"Before merge - Model type: {type(model)}")
print(f"Number of LoRA modules: {sum(1 for n, m in model.named_modules() if 'lora' in n.lower())}")

merged_model = model.merge_and_unload()

print(f"After merge - Model type: {type(merged_model)}")
print(f"Number of LoRA modules: {sum(1 for n, m in merged_model.named_modules() if 'lora' in n.lower())}")

output_dir = "./merged-llama-2-7b"
merged_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"\nMerged model saved to {output_dir}")

test_input = tokenizer("The future of AI is", return_tensors="pt").to(merged_model.device)
with torch.no_grad():
    output = merged_model.generate(**test_input, max_length=20)
print(f"\nGenerated text: {tokenizer.decode(output[0])}")

Output

Before merge - Model type: <class 'peft.peft_model.PeftModelForCausalLM'>
Number of LoRA modules: 16
After merge - Model type: <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>
Number of LoRA modules: 0

Merged model saved to ./merged-llama-2-7b

Generated text: The future of AI is shaped by our ability to solve complex problems efficiently. We need systems that can learn, adapt

What just happened?

The code loaded a LoRA-adapted model from disk using <code>AutoPeftModelForCausalLM</code>. It confirmed LoRA modules were present (16 found in a Llama-2 7B with LoRA applied to q_proj, v_proj layers). Then <code>merge_and_unload()</code> permanently fused the LoRA weights into the base model parameters and removed the <code>PeftModel</code> wrapper, returning a standard <code>LlamaForCausalLM</code> object with zero LoRA modules. The merged model was saved as a standalone checkpoint that no longer requires PEFT to load.

Common gotcha

Developers often call merge_and_unload() without understanding that the original adapter model object becomes invalid afterward: any reference to the old model will still have LoRA wrappers even though merging succeeded. Always reassign: model = model.merge_and_unload(). Also, if you don't call unload() and only merge, the model still has the PEFT wrapper and adapter config, making it larger and slower than necessary.

Error recovery

AttributeError: 'LlamaForCausalLM' object has no attribute 'merge_and_unload'

You're calling merge on a base model, not a PEFT adapter. Load using AutoPeftModelForCausalLM, not AutoModelForCausalLM. Check that your checkpoint path contains PEFT config files (adapter_config.json, adapter_model.bin).

RuntimeError: CUDA out of memory

Merging happens in GPU memory by default if the model is there. Either move to CPU first with model.to('cpu'), or use device_map='cpu' when loading. For large models, CPU merging is fine: it's a one-time offline operation.

ValueError: Attempting to merge a model that has not been properly saved

The adapter was trained but never saved, or the path doesn't exist. Ensure you saved the adapter after training with trainer.model.save_pretrained(output_dir) and that output_dir contains adapter_config.json and adapter_model.bin.

Experienced dev note

In production, keep the adapter separate until final validation. Many teams merge too early and then discover they need to adjust hyperparameters or try a different seed: but the merged model can't be rolled back without retraining. Build your release pipeline to: (1) validate adapter in A/B test, (2) merge only after sign-off, (3) version the merged checkpoint separately from the base model. This adds 2 minutes to release but saves hours when you catch a bug pre-merge. Also: merging a float16 model in float16 is fast (~10 seconds for 7B), but some teams accidentally cast to float32 and run out of VRAM: explicitly specify torch_dtype when loading.

Check your understanding

If you merge a LoRA adapter trained on a base model, then someone loads the merged model without any PEFT code, will the model produce identical outputs to loading base + adapter together, and why or why not?

Show answer hint

A correct answer must mention that outputs will be numerically identical (within floating-point precision) because merging is mathematically just adding W_original + (LoRA_A @ LoRA_B * scaling), which is exactly what PEFT does at inference. The key insight is that merging doesn't change the computation: it just bakes it into permanent weights.

VERSION In peft < 0.9.0, merge_and_unload() required manual unload_adapter() calls afterward. Since peft 0.9.0 (released Q3 2024), merge_and_unload() is atomic and handles cleanup automatically. Ensure you're using peft >= 0.11.x (current stable) for this code to work without surprises.

Next, learn how to load and test the merged model to confirm inference quality matches the LoRA adapter, ensuring no numerical drift occurred during the merge operation.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.