How to beginner · 3 min read

How to serve DeepSeek model with vLLM

Q: How to serve DeepSeek model with vLLM

Use the vllm CLI to serve the DeepSeek model locally by running vllm serve deepseek-chat --port 8000. Then query it via the openai Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create with model="deepseek-chat". This enables efficient local inference with DeepSeek models using vLLM's server.

Quick answer

Use the vllm CLI to serve the DeepSeek model locally by running vllm serve deepseek-chat --port 8000. Then query it via the openai Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create with model="deepseek-chat". This enables efficient local inference with DeepSeek models using vLLM's server.

PREREQUISITES

Python 3.8+
DeepSeek API key (if using remote DeepSeek API)
pip install openai>=1.0
pip install vllm

Setup

Install the vllm package to serve DeepSeek models locally and the openai SDK to query the server. Ensure you have Python 3.8 or higher.

bash

pip install vllm openai

Step by step

Start the vLLM server hosting the DeepSeek model, then query it with Python using the OpenAI-compatible client.

python

from openai import OpenAI
import os

# Start the vLLM server in a separate terminal:
# vllm serve deepseek-chat --port 8000

# Python client to query the local vLLM server
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "Hello from vLLM DeepSeek server!"}]
)

print(response.choices[0].message.content)

output

Hello from vLLM DeepSeek server! How can I assist you today?

Common variations

Use different DeepSeek models like deepseek-reasoner by changing the model name in both the server command and client call.
Run the vLLM server with custom ports or additional flags for logging and concurrency.
Use async Python calls with asyncio and the OpenAI SDK for non-blocking requests.

python

import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
    response = await client.chat.completions.acreate(
        model="deepseek-chat",
        messages=[{"role": "user", "content": "Async request to DeepSeek model."}]
    )
    print(response.choices[0].message.content)

asyncio.run(main())

output

Async request to DeepSeek model. How can I help you?

Troubleshooting

If you see connection errors, verify the vLLM server is running on the specified port.
Ensure base_url matches the server address including the /v1 path.
Check your environment variable OPENAI_API_KEY is set even for local serving as the SDK requires it.

✅

Key Takeaways

Use the vLLM CLI to serve DeepSeek models locally with the command: vllm serve deepseek-chat --port 8000.
Query the local server using the OpenAI Python SDK with base_url set to the vLLM server endpoint.
You can run async queries and switch DeepSeek models by adjusting the model name in both server and client calls.

Verified 2026-04 · deepseek-chat, deepseek-reasoner

Verify ↗