How to Intermediate · 4 min read

How to serve LLMs with Ollama in production

Q: How to serve LLMs with Ollama in production

Use Ollama to serve LLMs in production by installing the ollama CLI, running your model as a local or remote server, and integrating it via REST or gRPC APIs. Ollama simplifies deployment with containerized models and supports scalable, low-latency inference suitable for production workloads.

Quick answer

Use Ollama to serve LLMs in production by installing the ollama CLI, running your model as a local or remote server, and integrating it via REST or gRPC APIs. Ollama simplifies deployment with containerized models and supports scalable, low-latency inference suitable for production workloads.

PREREQUISITES

Python 3.8+
Ollama CLI installed (https://ollama.com/docs/cli)
Docker installed (optional for containerized deployment)
Basic knowledge of REST APIs or gRPC
API key or access credentials if using remote Ollama hosting

Setup Ollama environment

Install the ollama CLI tool to manage and serve LLMs locally or remotely. Optionally, install Docker for containerized deployment. Set environment variables for authentication if using Ollama cloud services.

bash

brew install ollama
# or for Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama version

output

ollama version 1.0.0

Step by step serving example

Run a model locally with Ollama and serve it via REST API. This example uses the ollama run command to start the model server and Python requests to query it.

python

import os
import requests

# Start the Ollama model server (run in terminal):
# ollama run llama2

# Python client to query the local Ollama server
url = 'http://localhost:11434/v1/chat/completions'
headers = {'Content-Type': 'application/json'}
data = {
    'model': 'llama2',
    'messages': [{'role': 'user', 'content': 'Hello Ollama!'}]
}

response = requests.post(url, json=data, headers=headers)
print(response.json())

output

{"choices": [{"message": {"role": "assistant", "content": "Hello! How can I assist you today?"}}]}

Common variations

You can serve models asynchronously, use different models like llama2 or gpt4o, and deploy Ollama in Docker containers for scalable production. Ollama also supports gRPC APIs for lower latency.

python

# Async example using aiohttp
import aiohttp
import asyncio

async def query_ollama():
    url = 'http://localhost:11434/v1/chat/completions'
    headers = {'Content-Type': 'application/json'}
    data = {
        'model': 'llama2',
        'messages': [{'role': 'user', 'content': 'Async request'}]
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=data, headers=headers) as resp:
            result = await resp.json()
            print(result)

asyncio.run(query_ollama())

output

{"choices": [{"message": {"role": "assistant", "content": "This is an async response from Ollama."}}]}

Troubleshooting common issues

If the Ollama server is unreachable, verify it is running with ollama run <model> and listening on the correct port.
For authentication errors, check your API keys or environment variables.
Docker deployment issues often relate to port conflicts or missing volumes; ensure ports are exposed and volumes mounted correctly.

Key Takeaways

Install and run Ollama CLI to serve LLMs locally or remotely with minimal setup.
Use Ollama's REST or gRPC APIs to integrate LLM inference into production applications.
Leverage Docker containers for scalable and isolated Ollama deployments.
Async API calls improve throughput and responsiveness in production environments.
Check server status and environment variables to troubleshoot common connection issues.

Verified 2026-04 · llama2, gpt4o

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.