How to Intermediate · 4 min read

How to serve fine-tuned model with vLLM

Q: How to serve fine-tuned model with vLLM

To serve a fine-tuned model with vLLM, first export your fine-tuned weights in a compatible format (e.g., Hugging Face Transformers). Then, load the fine-tuned model into vLLM using its Python API or CLI, specifying the model path. Finally, run the vLLM server to handle inference requests efficiently with low latency.

Quick answer

To serve a fine-tuned model with vLLM, first export your fine-tuned weights in a compatible format (e.g., Hugging Face Transformers). Then, load the fine-tuned model into vLLM using its Python API or CLI, specifying the model path. Finally, run the vLLM server to handle inference requests efficiently with low latency.

PREREQUISITES

Python 3.8+
pip install vllm>=0.1.0
Fine-tuned model weights in Hugging Face Transformers format
Basic knowledge of command line and Python

Setup

Install vLLM via pip and prepare your fine-tuned model in Hugging Face format. Ensure Python 3.8 or higher is installed.

bash

pip install vllm

Step by step

Load your fine-tuned model with vLLM and start the server to serve inference requests.

python

from vllm import LLM

# Path to your fine-tuned model directory
model_path = "/path/to/fine-tuned-model"

# Initialize the vLLM model
llm = LLM(model=model_path)

# Example inference
response = llm.generate("What is the capital of France?", max_tokens=10)
print(response.generations[0].text)

# To serve via CLI, run:
# vllm serve --model /path/to/fine-tuned-model --host 0.0.0.0 --port 8000

output

Paris

Common variations

Use vLLM CLI for serving: vllm serve --model /path/to/model
Serve different fine-tuned models by changing the model path
Use async Python API for batch inference

python

import asyncio
from vllm import AsyncLLM

async def main():
    llm = AsyncLLM(model="/path/to/fine-tuned-model")
    response = await llm.agenerate(["Hello, world!", "What is AI?"])
    for gen in response.generations:
        print(gen[0].text)

asyncio.run(main())

output

Hello, world!
Artificial intelligence (AI) is...

Troubleshooting

If you see model loading errors, verify the model path and format are correct.
For performance issues, check GPU availability and batch size settings.
Ensure vLLM version is compatible with your fine-tuned model.

✅

Key Takeaways

Export fine-tuned models in Hugging Face Transformers format for compatibility with vLLM.
Use vLLM's Python API or CLI to load and serve your fine-tuned model efficiently.
Async API enables batch inference for higher throughput.
Verify model paths and environment setup to avoid loading errors.
Adjust batch size and hardware settings to optimize serving performance.

Verified 2026-04 · vLLM, Hugging Face Transformers

Verify ↗