How to beginner · 3 min read

How to serve Llama model with vLLM

Q: How to serve Llama model with vLLM

Use the vllm CLI to serve a Llama model locally with vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000. Then query it via the OpenAI-compatible Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create() with your prompt.

Quick answer

Use the vllm CLI to serve a Llama model locally with vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000. Then query it via the OpenAI-compatible Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create() with your prompt.

PREREQUISITES

Python 3.8+
pip install vllm openai
meta-llama/Llama-3.1-8B-Instruct model downloaded or accessible
OpenAI SDK v1+ installed

Setup

Install vllm and openai Python packages. Ensure you have the Llama model weights locally or accessible via Hugging Face Hub. Set up environment variables if needed.

bash

pip install vllm openai

Step by step

Start the vLLM server with the Llama model on port 8000, then query it using the OpenAI Python SDK with the base_url pointing to the local server.

python

from openai import OpenAI
import os

# Start the vLLM server in a separate terminal:
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Python client code to query the running server
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello, vLLM!"}]
)

print(response.choices[0].message.content)

output

Hello, vLLM! How can I assist you today?

Common variations

Use different Llama versions by changing the model name in the CLI and client code.
Run the server on a different port by modifying the --port argument.
Use asynchronous calls or streaming with the OpenAI SDK if supported.

Troubleshooting

If the client cannot connect, verify the server is running and the port matches.
Ensure the model path or name is correct and accessible.
Check for firewall or network issues blocking localhost:8000.

✅

Key Takeaways

Use the vllm serve CLI command to start a local Llama model server.
Query the running server with the OpenAI SDK by setting base_url to the server URL.
Adjust model name and port as needed for different setups.
Ensure the server is running before sending requests to avoid connection errors.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗