How to serve DeepSeek model with vLLM
Quick answer
Use the
vllm CLI to serve the DeepSeek model locally by running vllm serve deepseek-chat --port 8000. Then query it via the openai Python SDK by setting base_url="http://localhost:8000/v1" and calling client.chat.completions.create with model="deepseek-chat". This enables efficient local inference with DeepSeek models using vLLM's server.PREREQUISITES
Python 3.8+DeepSeek API key (if using remote DeepSeek API)pip install openai>=1.0pip install vllm
Setup
Install the vllm package to serve DeepSeek models locally and the openai SDK to query the server. Ensure you have Python 3.8 or higher.
pip install vllm openai Step by step
Start the vLLM server hosting the DeepSeek model, then query it with Python using the OpenAI-compatible client.
from openai import OpenAI
import os
# Start the vLLM server in a separate terminal:
# vllm serve deepseek-chat --port 8000
# Python client to query the local vLLM server
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "Hello from vLLM DeepSeek server!"}]
)
print(response.choices[0].message.content) output
Hello from vLLM DeepSeek server! How can I assist you today?
Common variations
- Use different DeepSeek models like
deepseek-reasonerby changing the model name in both the server command and client call. - Run the vLLM server with custom ports or additional flags for logging and concurrency.
- Use async Python calls with
asyncioand the OpenAI SDK for non-blocking requests.
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = await client.chat.completions.acreate(
model="deepseek-chat",
messages=[{"role": "user", "content": "Async request to DeepSeek model."}]
)
print(response.choices[0].message.content)
asyncio.run(main()) output
Async request to DeepSeek model. How can I help you?
Troubleshooting
- If you see connection errors, verify the vLLM server is running on the specified port.
- Ensure
base_urlmatches the server address including the/v1path. - Check your environment variable
OPENAI_API_KEYis set even for local serving as the SDK requires it.
Key Takeaways
- Use the vLLM CLI to serve DeepSeek models locally with the command: vllm serve deepseek-chat --port 8000.
- Query the local server using the OpenAI Python SDK with base_url set to the vLLM server endpoint.
- You can run async queries and switch DeepSeek models by adjusting the model name in both server and client calls.