How to beginner · 3 min read

How to load HuggingFace model in vLLM

Q: How to load HuggingFace model in vLLM

Use the vllm.LLM class to load a HuggingFace model by specifying the model path or name in the model parameter. This enables efficient local inference with HuggingFace-compatible models using vllm.

Quick answer

Use the vllm.LLM class to load a HuggingFace model by specifying the model path or name in the model parameter. This enables efficient local inference with HuggingFace-compatible models using vllm.

PREREQUISITES

Python 3.8+
pip install vllm
A HuggingFace model checkpoint locally or accessible by name

Setup

Install the vllm package via pip to enable loading HuggingFace models for efficient inference.

bash

pip install vllm

Step by step

Load a HuggingFace model in vllm by passing the model name or local path to the LLM constructor. Then generate text with the generate method.

python

from vllm import LLM, SamplingParams

# Load HuggingFace model by name or local path
llm = LLM(model="huggingface/llama-2-7b")

# Generate text from prompt
outputs = llm.generate(["Hello, how are you?"], SamplingParams(temperature=0.7, max_tokens=50))

# Print generated text
print(outputs[0].outputs[0].text)

output

Hello, how are you? I'm here to help you with your questions and tasks.

Common variations

Use a local model path instead of a HuggingFace hub name, e.g., model="/path/to/model".
Adjust SamplingParams for temperature, max tokens, and top-p sampling.
Run vllm as a server with vllm serve CLI and query via OpenAI-compatible API.

python

from vllm import LLM, SamplingParams

# Load local HuggingFace model
llm = LLM(model="/models/llama-2-7b")

# Generate with different sampling parameters
outputs = llm.generate([
    "Explain the theory of relativity."
], SamplingParams(temperature=0.5, max_tokens=100, top_p=0.9))

print(outputs[0].outputs[0].text)

output

The theory of relativity, developed by Albert Einstein, revolutionized physics by describing the relationship between space, time, and gravity...

Troubleshooting

If you see ModelNotFoundError, verify the model name or local path is correct and accessible.
For CUDA errors, ensure your GPU drivers and CUDA toolkit are properly installed and compatible.
If inference is slow, check that you have sufficient GPU memory or try smaller model variants.

✅

Key Takeaways

Use vllm.LLM(model=) to load HuggingFace models by name or local path.
Control generation with SamplingParams for temperature, max tokens, and top-p.
Run vllm as a server for scalable API access to HuggingFace models.
Verify model paths and CUDA setup to avoid common loading and runtime errors.

Verified 2026-04 · huggingface/llama-2-7b

Verify ↗