How to Intermediate · 3 min read

How to use LoRA adapters with vLLM

Quick answer
Use vLLM by loading your base model and applying LoRA adapters via the peft library before running inference. This enables efficient fine-tuning by injecting low-rank adapters into the model weights without retraining the entire model.

PREREQUISITES

  • Python 3.8+
  • pip install vllm peft transformers torch
  • Access to a compatible base model checkpoint

Setup

Install the required Python packages vllm, peft, transformers, and torch to enable LoRA adapter integration with vLLM.

bash
pip install vllm peft transformers torch

Step by step

Load your base model with transformers, apply LoRA adapters using peft, then pass the adapted model to vLLM for inference.

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from vllm import LLM, SamplingParams
import torch

# Load base model and tokenizer
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.float16, device_map="auto")

# Load LoRA adapter
lora_adapter_path = "./lora_adapter"
adapted_model = PeftModel.from_pretrained(base_model, lora_adapter_path)
adapted_model.eval()

# Save the adapted model to a temporary directory for vLLM
adapted_model.save_pretrained("./adapted_model")

# Initialize vLLM with the adapted model
llm = LLM(model="./adapted_model")

# Prepare prompt
prompt = "Translate English to French: 'Hello, how are you?'"

# Generate output
outputs = llm.generate([prompt], SamplingParams(temperature=0.7, max_tokens=50))

# Print generated text
print(outputs[0].outputs[0].text)
output
Bonjour, comment ça va ?

Common variations

  • Use different base models compatible with LoRA and vLLM.
  • Adjust SamplingParams for temperature, max tokens, or top-p sampling.
  • Run inference asynchronously by integrating with async event loops.
  • Load LoRA adapters from Hugging Face Hub or local paths.

Troubleshooting

  • If you see CUDA out-of-memory errors, reduce batch size or use smaller models.
  • Ensure peft and transformers versions are compatible.
  • Verify LoRA adapter path is correct and contains necessary files.
  • Use torch_dtype=torch.float16 and device_map="auto" to optimize GPU memory usage.

Key Takeaways

  • Use the peft library to load and apply LoRA adapters to your base model before inference with vLLM.
  • Save the adapted model locally and load it in vLLM for efficient parameter-efficient fine-tuning inference.
  • Adjust SamplingParams in vLLM to control generation behavior like temperature and max tokens.
  • Ensure compatible versions of transformers, peft, and vLLM to avoid runtime errors.
  • Optimize GPU memory by using half precision and automatic device mapping when loading models.
Verified 2026-04 · meta-llama/Llama-2-7b-chat-hf
Verify ↗