How to Intermediate · 3 min read

How to run Llama with vLLM

Quick answer
Use the vllm Python package to run Llama models locally by starting the vLLM server with a Llama model checkpoint, then query it via the OpenAI-compatible openai SDK pointing to the local server. This setup enables efficient, low-latency inference with Llama models like meta-llama/Llama-3.1-8B-Instruct.

PREREQUISITES

  • Python 3.8+
  • pip install vllm openai
  • Download Llama model checkpoint (e.g. meta-llama/Llama-3.1-8B-Instruct)
  • OpenAI API key environment variable (for querying local vLLM server)

Setup vLLM server

Install the vllm package and download the Llama model checkpoint. Then start the vLLM server locally on port 8000 with the desired Llama model.

bash
pip install vllm openai

# Download model checkpoint from Hugging Face or Meta's release
# Example command to start server:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Run inference with Python

Use the OpenAI SDK with base_url pointing to the local vLLM server to send chat completion requests to the Llama model.

python
import os
from openai import OpenAI

# No real API key needed for local server but set dummy key
os.environ["OPENAI_API_KEY"] = "test"

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain chain-of-thought prompting."}]
)

print(response.choices[0].message.content)
output
Chain-of-thought prompting is a technique where the model is guided to reason step-by-step, improving accuracy on complex tasks.

Common variations

  • Use different Llama model sizes by changing the model name in the vllm serve command and in the Python client.
  • Run the server on a different port by adjusting the --port flag and base_url.
  • Use SamplingParams from vllm for advanced generation control if calling vLLM Python API directly.

Troubleshooting

  • If you get connection errors, ensure the vLLM server is running and accessible at the specified port.
  • Check that the model checkpoint path is correct and compatible with vLLM.
  • For GPU memory errors, try smaller Llama models or use 8-bit/4-bit quantized versions if supported.

Key Takeaways

  • Start the vLLM server with the desired Llama model checkpoint using the CLI.
  • Query the running vLLM server via OpenAI SDK with base_url set to the local server endpoint.
  • Adjust model and server port easily for different use cases and hardware constraints.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗