How to Intermediate · 3 min read

How to run Llama with vLLM

Q: How to run Llama with vLLM

Use the vllm Python package to run Llama models locally by starting the vLLM server with a Llama model checkpoint, then query it via the OpenAI-compatible openai SDK pointing to the local server. This setup enables efficient, low-latency inference with Llama models like meta-llama/Llama-3.1-8B-Instruct.

Quick answer

Use the vllm Python package to run Llama models locally by starting the vLLM server with a Llama model checkpoint, then query it via the OpenAI-compatible openai SDK pointing to the local server. This setup enables efficient, low-latency inference with Llama models like meta-llama/Llama-3.1-8B-Instruct.

PREREQUISITES

Python 3.8+
pip install vllm openai
Download Llama model checkpoint (e.g. meta-llama/Llama-3.1-8B-Instruct)
OpenAI API key environment variable (for querying local vLLM server)

Setup vLLM server

Install the vllm package and download the Llama model checkpoint. Then start the vLLM server locally on port 8000 with the desired Llama model.

bash

pip install vllm openai

# Download model checkpoint from Hugging Face or Meta's release
# Example command to start server:
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Run inference with Python

Use the OpenAI SDK with base_url pointing to the local vLLM server to send chat completion requests to the Llama model.

python

import os
from openai import OpenAI

# No real API key needed for local server but set dummy key
os.environ["OPENAI_API_KEY"] = "test"

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain chain-of-thought prompting."}]
)

print(response.choices[0].message.content)

output

Chain-of-thought prompting is a technique where the model is guided to reason step-by-step, improving accuracy on complex tasks.

Common variations

Use different Llama model sizes by changing the model name in the vllm serve command and in the Python client.
Run the server on a different port by adjusting the --port flag and base_url.
Use SamplingParams from vllm for advanced generation control if calling vLLM Python API directly.

Troubleshooting

If you get connection errors, ensure the vLLM server is running and accessible at the specified port.
Check that the model checkpoint path is correct and compatible with vLLM.
For GPU memory errors, try smaller Llama models or use 8-bit/4-bit quantized versions if supported.

Key Takeaways

Start the vLLM server with the desired Llama model checkpoint using the CLI.
Query the running vLLM server via OpenAI SDK with base_url set to the local server endpoint.
Adjust model and server port easily for different use cases and hardware constraints.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.