How to serve LLMs with Ollama in production
Quick answer
Use
Ollama to serve LLMs in production by installing the ollama CLI, running your model as a local or remote server, and integrating it via REST or gRPC APIs. Ollama simplifies deployment with containerized models and supports scalable, low-latency inference suitable for production workloads.PREREQUISITES
Python 3.8+Ollama CLI installed (https://ollama.com/docs/cli)Docker installed (optional for containerized deployment)Basic knowledge of REST APIs or gRPCAPI key or access credentials if using remote Ollama hosting
Setup Ollama environment
Install the ollama CLI tool to manage and serve LLMs locally or remotely. Optionally, install Docker for containerized deployment. Set environment variables for authentication if using Ollama cloud services.
brew install ollama
# or for Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama version output
ollama version 1.0.0
Step by step serving example
Run a model locally with Ollama and serve it via REST API. This example uses the ollama run command to start the model server and Python requests to query it.
import os
import requests
# Start the Ollama model server (run in terminal):
# ollama run llama2
# Python client to query the local Ollama server
url = 'http://localhost:11434/v1/chat/completions'
headers = {'Content-Type': 'application/json'}
data = {
'model': 'llama2',
'messages': [{'role': 'user', 'content': 'Hello Ollama!'}]
}
response = requests.post(url, json=data, headers=headers)
print(response.json()) output
{"choices": [{"message": {"role": "assistant", "content": "Hello! How can I assist you today?"}}]} Common variations
You can serve models asynchronously, use different models like llama2 or gpt4o, and deploy Ollama in Docker containers for scalable production. Ollama also supports gRPC APIs for lower latency.
# Async example using aiohttp
import aiohttp
import asyncio
async def query_ollama():
url = 'http://localhost:11434/v1/chat/completions'
headers = {'Content-Type': 'application/json'}
data = {
'model': 'llama2',
'messages': [{'role': 'user', 'content': 'Async request'}]
}
async with aiohttp.ClientSession() as session:
async with session.post(url, json=data, headers=headers) as resp:
result = await resp.json()
print(result)
asyncio.run(query_ollama()) output
{"choices": [{"message": {"role": "assistant", "content": "This is an async response from Ollama."}}]} Troubleshooting common issues
- If the Ollama server is unreachable, verify it is running with
ollama run <model>and listening on the correct port. - For authentication errors, check your API keys or environment variables.
- Docker deployment issues often relate to port conflicts or missing volumes; ensure ports are exposed and volumes mounted correctly.
Key Takeaways
- Install and run Ollama CLI to serve LLMs locally or remotely with minimal setup.
- Use Ollama's REST or gRPC APIs to integrate LLM inference into production applications.
- Leverage Docker containers for scalable and isolated Ollama deployments.
- Async API calls improve throughput and responsiveness in production environments.
- Check server status and environment variables to troubleshoot common connection issues.