How to run Llama with vLLM
Quick answer
Use the
vllm Python package to run Llama models locally by starting the vLLM server with a Llama model checkpoint, then query it via the OpenAI-compatible openai SDK pointing to the local server. This setup enables efficient, low-latency inference with Llama models like meta-llama/Llama-3.1-8B-Instruct.PREREQUISITES
Python 3.8+pip install vllm openaiDownload Llama model checkpoint (e.g. meta-llama/Llama-3.1-8B-Instruct)OpenAI API key environment variable (for querying local vLLM server)
Setup vLLM server
Install the vllm package and download the Llama model checkpoint. Then start the vLLM server locally on port 8000 with the desired Llama model.
pip install vllm openai
# Download model checkpoint from Hugging Face or Meta's release
# Example command to start server:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 Run inference with Python
Use the OpenAI SDK with base_url pointing to the local vLLM server to send chat completion requests to the Llama model.
import os
from openai import OpenAI
# No real API key needed for local server but set dummy key
os.environ["OPENAI_API_KEY"] = "test"
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain chain-of-thought prompting."}]
)
print(response.choices[0].message.content) output
Chain-of-thought prompting is a technique where the model is guided to reason step-by-step, improving accuracy on complex tasks.
Common variations
- Use different Llama model sizes by changing the model name in the
vllm servecommand and in the Python client. - Run the server on a different port by adjusting the
--portflag andbase_url. - Use
SamplingParamsfromvllmfor advanced generation control if calling vLLM Python API directly.
Troubleshooting
- If you get connection errors, ensure the vLLM server is running and accessible at the specified port.
- Check that the model checkpoint path is correct and compatible with vLLM.
- For GPU memory errors, try smaller Llama models or use 8-bit/4-bit quantized versions if supported.
Key Takeaways
- Start the vLLM server with the desired Llama model checkpoint using the CLI.
- Query the running vLLM server via OpenAI SDK with
base_urlset to the local server endpoint. - Adjust model and server port easily for different use cases and hardware constraints.