How to load HuggingFace model in vLLM
Quick answer
Use the
vllm.LLM class to load a HuggingFace model by specifying the model path or name in the model parameter. This enables efficient local inference with HuggingFace-compatible models using vllm.PREREQUISITES
Python 3.8+pip install vllmA HuggingFace model checkpoint locally or accessible by name
Setup
Install the vllm package via pip to enable loading HuggingFace models for efficient inference.
pip install vllm Step by step
Load a HuggingFace model in vllm by passing the model name or local path to the LLM constructor. Then generate text with the generate method.
from vllm import LLM, SamplingParams
# Load HuggingFace model by name or local path
llm = LLM(model="huggingface/llama-2-7b")
# Generate text from prompt
outputs = llm.generate(["Hello, how are you?"], SamplingParams(temperature=0.7, max_tokens=50))
# Print generated text
print(outputs[0].outputs[0].text) output
Hello, how are you? I'm here to help you with your questions and tasks.
Common variations
- Use a local model path instead of a HuggingFace hub name, e.g.,
model="/path/to/model". - Adjust
SamplingParamsfor temperature, max tokens, and top-p sampling. - Run
vllmas a server withvllm serveCLI and query via OpenAI-compatible API.
from vllm import LLM, SamplingParams
# Load local HuggingFace model
llm = LLM(model="/models/llama-2-7b")
# Generate with different sampling parameters
outputs = llm.generate([
"Explain the theory of relativity."
], SamplingParams(temperature=0.5, max_tokens=100, top_p=0.9))
print(outputs[0].outputs[0].text) output
The theory of relativity, developed by Albert Einstein, revolutionized physics by describing the relationship between space, time, and gravity...
Troubleshooting
- If you see
ModelNotFoundError, verify the model name or local path is correct and accessible. - For CUDA errors, ensure your GPU drivers and CUDA toolkit are properly installed and compatible.
- If inference is slow, check that you have sufficient GPU memory or try smaller model variants.
Key Takeaways
- Use
vllm.LLM(model=)to load HuggingFace models by name or local path. - Control generation with
SamplingParamsfor temperature, max tokens, and top-p. - Run
vllmas a server for scalable API access to HuggingFace models. - Verify model paths and CUDA setup to avoid common loading and runtime errors.