How to serve fine-tuned model with vLLM
Quick answer
To serve a fine-tuned model with
vLLM, first export your fine-tuned weights in a compatible format (e.g., Hugging Face Transformers). Then, load the fine-tuned model into vLLM using its Python API or CLI, specifying the model path. Finally, run the vLLM server to handle inference requests efficiently with low latency.PREREQUISITES
Python 3.8+pip install vllm>=0.1.0Fine-tuned model weights in Hugging Face Transformers formatBasic knowledge of command line and Python
Setup
Install vLLM via pip and prepare your fine-tuned model in Hugging Face format. Ensure Python 3.8 or higher is installed.
pip install vllm Step by step
Load your fine-tuned model with vLLM and start the server to serve inference requests.
from vllm import LLM
# Path to your fine-tuned model directory
model_path = "/path/to/fine-tuned-model"
# Initialize the vLLM model
llm = LLM(model=model_path)
# Example inference
response = llm.generate("What is the capital of France?", max_tokens=10)
print(response.generations[0].text)
# To serve via CLI, run:
# vllm serve --model /path/to/fine-tuned-model --host 0.0.0.0 --port 8000 output
Paris
Common variations
- Use
vLLMCLI for serving:vllm serve --model /path/to/model - Serve different fine-tuned models by changing the
modelpath - Use async Python API for batch inference
import asyncio
from vllm import AsyncLLM
async def main():
llm = AsyncLLM(model="/path/to/fine-tuned-model")
response = await llm.agenerate(["Hello, world!", "What is AI?"])
for gen in response.generations:
print(gen[0].text)
asyncio.run(main()) output
Hello, world! Artificial intelligence (AI) is...
Troubleshooting
- If you see model loading errors, verify the model path and format are correct.
- For performance issues, check GPU availability and batch size settings.
- Ensure
vLLMversion is compatible with your fine-tuned model.
Key Takeaways
- Export fine-tuned models in Hugging Face Transformers format for compatibility with vLLM.
- Use vLLM's Python API or CLI to load and serve your fine-tuned model efficiently.
- Async API enables batch inference for higher throughput.
- Verify model paths and environment setup to avoid loading errors.
- Adjust batch size and hardware settings to optimize serving performance.