How to serve Llama model with vLLM
Quick answer
Use the
vllm CLI to serve a Llama model locally with vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000. Then query it via the OpenAI-compatible Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create() with your prompt.PREREQUISITES
Python 3.8+pip install vllm openaimeta-llama/Llama-3.1-8B-Instruct model downloaded or accessibleOpenAI SDK v1+ installed
Setup
Install vllm and openai Python packages. Ensure you have the Llama model weights locally or accessible via Hugging Face Hub. Set up environment variables if needed.
pip install vllm openai Step by step
Start the vLLM server with the Llama model on port 8000, then query it using the OpenAI Python SDK with the base_url pointing to the local server.
from openai import OpenAI
import os
# Start the vLLM server in a separate terminal:
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
# Python client code to query the running server
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello, vLLM!"}]
)
print(response.choices[0].message.content) output
Hello, vLLM! How can I assist you today?
Common variations
- Use different Llama versions by changing the model name in the CLI and client code.
- Run the server on a different port by modifying the
--portargument. - Use asynchronous calls or streaming with the OpenAI SDK if supported.
Troubleshooting
- If the client cannot connect, verify the server is running and the port matches.
- Ensure the model path or name is correct and accessible.
- Check for firewall or network issues blocking
localhost:8000.
Key Takeaways
- Use the
vllm serveCLI command to start a local Llama model server. - Query the running server with the OpenAI SDK by setting
base_urlto the server URL. - Adjust model name and port as needed for different setups.
- Ensure the server is running before sending requests to avoid connection errors.