Structured Course

Ollama

From first install to production patterns. Every lesson is standalone — jump to what you need, or work through from beginner to advanced.

145 lessons 3 levels Beginner → Advanced

Beginner

47 lessons · 7 chapters

See all →

What Ollama Is and Why It Exists 5

Installation and Setup 7

Pulling and Running Models 7

+4 more chapters

Start Beginner →

Intermediate

49 lessons · 7 chapters

See all →

GPU Acceleration 7

Custom Modelfiles 7

Tool Calling with Ollama 7

+4 more chapters

Start Intermediate →

Advanced

49 lessons · 7 chapters

See all →

Remote Ollama Server 7

Docker Deployment 7

Kubernetes Deployment 7

+4 more chapters

Start Advanced →

Full Course Contents

Beginner

47 lessons

1 What Ollama Is and Why It Exists 5

The local LLM problem before Ollama Running LLMs locally required manual Docker setup, GPU driver wrangling, and HTTP request plumbing: Ollama eliminated all that friction.

What Ollama provides: one command to run any model Ollama lets you run open-source LLMs locally with a single Python function call, no GPU setup or Docker knowledge required.

Ollama vs running transformers directly Ollama is a runtime that manages model loading and inference, while transformers library gives you direct control: pick based on whether you need convenience or customization.

Privacy guarantee: no data leaves your machine Ollama runs language models entirely on your machine: no API calls, no server uploads, no data leakage.

When local LLMs beat cloud APIs Local LLMs with Ollama eliminate latency, cost, and privacy concerns when your use case doesn't require frontier models.

2 Installation and Setup 7

Installing Ollama: macOS, Linux, Windows Install Ollama on your machine and verify it runs a local LLM via Python.

Verifying installation: ollama --version Check that ollama is installed correctly by running the version command from Python.

ollama serve: starting the server Start the Ollama server to enable your Python code to talk to local language models.

Server auto-start on system boot Configure ollama to start automatically when your system boots, so the server is always ready without manual intervention.

Port configuration: default 11434 Ollama runs on port 11434 by default, but you can change it by setting the OLLAMA_HOST environment variable before starting the server.

OLLAMA_HOST for remote access OLLAMA_HOST is an environment variable that tells the Ollama server which network address and port to listen on, enabling access from other machines instead of localhost only.

Firewall considerations Ollama runs a local API server on port 11434 by default: you need to understand firewall rules and network exposure to avoid accidentally making your models publicly accessible.

3 Pulling and Running Models 7

ollama pull: downloading a model Use <code>ollama pull</code> to download a model from the Ollama registry so you can run it locally.

Model naming: llama3.2, mistral, qwen2.5 Model names in Ollama tell you the base model family and version: they control which weights are pulled and how your prompts behave.

ollama run: interactive chat Use the ollama Python library to send messages to a running model and receive responses in an interactive conversation.

ollama list: what is installed Use <code>ollama.list()</code> to see which models are downloaded and ready to use on your machine.

Model storage location Ollama stores downloaded models in a platform-specific directory you can inspect and customize via the OLLAMA_MODELS environment variable.

Disk space requirements by model Different Ollama models consume vastly different amounts of disk space: from 2GB to 100GB+: and you need to check before pulling.

ollama rm: removing models Remove downloaded models from your local Ollama installation to free disk space and clean up unused models.

4 The Ollama API 7

REST API: OpenAI-compatible endpoints Ollama exposes an OpenAI-compatible REST API so you can swap it into existing OpenAI code without rewriting.

/api/generate: text generation Use the /api/generate endpoint to stream text generation from a local Ollama model one token at a time.

/api/chat: chat completion Use the chat endpoint to have multi-turn conversations with a local Ollama model by sending message history.

Request format: model and prompt Every Ollama request requires two things: a model name and a message: learn the minimal structure that makes an API call work.

Response streaming: JSON lines Stream an LLM response token-by-token using ollama's JSON lines format instead of waiting for the full response.

/api/tags: listing models Query which models are installed and available on your local Ollama server.

/api/show: model information Use the <code>/api/show</code> endpoint to inspect detailed metadata about a loaded model, including its parameters, quantization, and size.

5 Python Integration 7

Using OpenAI SDK with Ollama Ollama provides an OpenAI-compatible API endpoint so you can use the official OpenAI Python SDK against local models without changing your code.

Configuring base_url for Ollama API Calls The <code>base_url</code> parameter tells the ollama Python client where your Ollama server is running so it can send requests to the correct address.

api_key="ollama": dummy key required Ollama requires a dummy API key string to initialize the client, even though it's not actually authenticating anything.

chat.completions.create() with local model Use ollama's OpenAI-compatible API to send chat messages to a local model and get streaming or complete responses.

ollama Python library: native SDK Use the official ollama Python library to talk to local LLMs without writing HTTP requests yourself.

ollama.chat() vs OpenAI SDK Ollama's native API is simpler and runs locally; OpenAI SDK requires a network call to remote servers but offers more features.

Streaming with Ollama Python SDK Use the <code>stream=True</code> parameter to receive model responses token-by-token instead of waiting for the complete answer.

6 Model Selection 7

ollama.com/library: model catalog The ollama.com/library is the public registry where you discover, download, and manage pre-built language models.

Model sizes: 1B to 70B+ available Ollama runs models of vastly different sizes locally, from lightweight 1B parameter models to massive 70B+ models, and your choice depends on hardware and latency requirements.

Quantization tags: :q4_0, :q8_0 Quantization tags like :q4_0 and :q8_0 control how much a model is compressed, trading accuracy for memory and speed.

Hardware matching: RAM to model size Match your available RAM to the model size you want to run, because loading a model larger than your RAM will cause severe slowdown or failure.

Speed vs quality tradeoffs Smaller, faster models trade quality for response speed: choose based on your latency and accuracy needs.

Pulling specific quantization variants Learn how to pull different quantized versions of the same model to balance speed, memory, and accuracy.

7 Common Issues and Fixes 7

ollama not found: PATH issue When you install ollama but Python can't find the command, it's a PATH problem: here's how to fix it.

Server not running: port connection refused Diagnose and fix the most common connection error when ollama client code can't reach the ollama server.

Out of memory: model too large When a model consumes more RAM than available, Ollama fails to load it: understand why and how to detect it before running out of space.

Slow generation: no GPU detected When Ollama runs on CPU instead of GPU, generation becomes 10-100x slower: here's how to detect and fix it.

Model pull fails: network timeout When `ollama pull` times out, configure the HTTP client timeout and verify your connection to the model registry.

API returning 404: wrong endpoint path Ollama endpoints follow a specific path structure, and using HTTP requests instead of the Python library is the most common way to hit 404 errors.

Model context length exceeded When your prompt + response exceeds a model's maximum token limit, the request fails: and you need to know how to detect and handle it.

Intermediate

49 lessons

1 GPU Acceleration 7

NVIDIA GPU detection: nvidia-smi check Verify NVIDIA GPU availability and detect compute capability before running Ollama models.

CUDA setup for Ollama on Linux Configure NVIDIA CUDA acceleration for Ollama on Linux to run large language models at practical inference speeds.

Metal acceleration on macOS Enable GPU acceleration on macOS using Metal to run large language models 10–100x faster than CPU-only inference.

AMD ROCm support Enable GPU acceleration on AMD graphics cards by configuring ollama with ROCm support and verifying GPU detection.

GPU layer configuration: num_gpu Control how many layers of your model run on GPU versus CPU by setting num_gpu in Ollama to balance speed and memory usage.

Mixed CPU/GPU execution Control which layers of a model run on GPU vs CPU to optimize memory usage and inference speed on resource-constrained hardware.

Monitoring GPU usage during inference Track GPU memory and compute utilization in real time while running ollama inference to identify bottlenecks and validate hardware acceleration.

2 Custom Modelfiles 7

What a Modelfile is A Modelfile is a configuration blueprint that defines how to build and run a custom Ollama model, similar to a Dockerfile for containers.

FROM: base model selection Choosing the right base model in a Modelfile determines your local LLM's capabilities, speed, and resource consumption.

SYSTEM: setting system prompt Control how a model behaves by defining its system role and instructions before user messages.

PARAMETER: model defaults Ollama uses sensible defaults for model parameters like temperature and top_k, but you can override them per request or set them persistently in a Modelfile.

TEMPLATE: custom chat format Define your own chat message structure and prompt template to control exactly how messages are formatted when sent to the model.

ollama create: building from Modelfile Create custom Ollama models by writing a Modelfile that specifies base model, system prompt, parameters, and other configurations.

Iterating on a custom model Build a custom Ollama model from a base model, test it, modify the Modelfile, and rebuild to see changes without losing your work.

3 Tool Calling with Ollama 7

Models that support tool calling: Llama 3.1+, Qwen2.5, Mistral Tool calling lets language models request specific functions you define, enabling structured interactions instead of just text.

Tool definition format Define structured inputs for model function calls using JSON schema so models know what parameters they can request.

OpenAI SDK tool calling with Ollama Use the OpenAI Python SDK to call tools with Ollama models by routing requests through a local API-compatible server.

Parsing tool call responses Extract and parse structured tool calls from Ollama's response messages when the model invokes external functions.

Executing tools and returning results Use Ollama's tool calling to let language models invoke functions and process their results in a loop.

Reliability of local tool calling Local tool calling in Ollama can fail silently or produce malformed function calls: you need explicit validation and retry logic to make it reliable in production.

Fallback when tool calling fails Design resilient tool-calling workflows by catching failures and falling back to direct LLM responses or alternative actions.

4 Embeddings with Ollama 7

Embedding-capable models: nomic-embed-text, mxbai-embed-large Use specialized embedding models to convert text into high-dimensional vectors for semantic search and similarity tasks, separate from chat models.

/api/embeddings endpoint Generate vector embeddings from text using Ollama's embeddings endpoint to power semantic search and similarity comparisons.

ollama.embeddings() Python call Convert text into numerical vectors using Ollama's embeddings API to enable semantic similarity search and clustering.

Embedding dimensions by model Different Ollama embedding models produce vectors of different dimensions, which affects downstream task compatibility and memory usage.

Batch embedding multiple texts Generate embeddings for multiple texts at once instead of calling the API for each one individually.

Local semantic search with Ollama embeddings Use Ollama's embedding models to convert text into vectors, then find semantically similar documents without external APIs.

5 Multimodal Models 7

Vision models in Ollama: llava, minicpm-v Use vision models like LLaVA and MiniCPM-V to analyze images directly in Ollama without external APIs.

Image input format: base64 encoding Ollama's vision-capable models require images as base64-encoded strings, not file paths.

ollama.chat() with images Pass image data directly to ollama.chat() so vision models can analyze photos, diagrams, and screenshots.

Multiple images per message Send multiple images in a single message to multimodal models like llava for batch image analysis.

Vision model hardware requirements Vision models in Ollama require significantly more VRAM and compute than text-only models: you must validate hardware before deployment.

What local vision can and cannot do Local vision models can describe and analyze images but struggle with fine detail, reasoning about text, and spatial relationships: understanding these limits prevents integration failures.

Vision quality comparison to GPT-4o Understand how local vision models in Ollama compare to GPT-4o by running identical image analysis tasks and comparing outputs side-by-side.

6 LangChain and LlamaIndex Integration 7

OllamaLLM: LangChain wrapper Use LangChain's OllamaLLM to integrate local Ollama models into LangChain chains and agents with a unified interface.

ChatOllama for chat models ChatOllama wraps local Ollama models as LangChain chat interfaces, letting you drop local LLMs into chat applications without API calls.

OllamaEmbeddings for RAG Use Ollama's embedding models to convert documents into vectors for retrieval-augmented generation (RAG) without external APIs.

LlamaIndex OllamaLLM Use LlamaIndex's OllamaLLM wrapper to integrate local Ollama models into retrieval-augmented generation (RAG) pipelines.

Using Ollama in a RAG chain Build a retrieval-augmented generation pipeline that fetches relevant documents and feeds them to a local Ollama model for context-aware responses.

Streaming with LangChain + Ollama Stream token-by-token responses from Ollama through LangChain's LCEL to show results in real-time instead of waiting for the full response.

Local-only RAG pipeline Build a retrieval-augmented generation system that runs entirely locally: model, embeddings, and document storage: without touching external APIs.

7 Concurrent Requests and Performance 7

OLLAMA_NUM_PARALLEL: parallel requests Control how many inference requests Ollama processes simultaneously by setting the OLLAMA_NUM_PARALLEL environment variable.

OLLAMA_MAX_LOADED_MODELS: multiple models Control how many models Ollama keeps loaded in memory simultaneously to balance performance and resource usage.

Request queuing behavior Ollama queues requests in memory when concurrent calls exceed available GPU capacity, blocking until space becomes available.

Context length per parallel request Each parallel request to Ollama consumes its own context window independently, and exceeding a model's context limit will silently truncate or fail on a per-request basis.

Memory scaling for concurrency Manage GPU and system memory to handle multiple concurrent Ollama requests without crashes or thrashing.

Benchmarking concurrent throughput Measure how many simultaneous requests your local Ollama instance can handle by spawning concurrent clients and tracking response times.

When parallelism hurts performance Spawning too many concurrent Ollama requests starves the GPU and context, causing slower total throughput than sequential processing.

Advanced

49 lessons

1 Remote Ollama Server 7

OLLAMA_HOST=0.0.0.0: network binding OLLAMA_HOST=0.0.0.0 exposes your Ollama server to all network interfaces instead of just localhost, enabling remote access from other machines.

Nginx reverse proxy for Ollama Route multiple Ollama instances and manage request load through Nginx while preserving streaming responses.

HTTPS with Let's Encrypt Secure your Ollama API endpoints with HTTPS using Let's Encrypt certificates and a reverse proxy.

API key auth via reverse proxy Secure Ollama behind a reverse proxy that validates API keys before forwarding requests to the model server.

Rate limiting at the proxy level Implement token-bucket rate limiting in front of Ollama to prevent model overload and enforce fair resource allocation across clients.

Remote client configuration Configure the Ollama Python client to connect to a remote Ollama server instead of localhost, with proper host validation and error handling.

Security hardening for remote serving Secure an Ollama server for remote access by implementing authentication, network isolation, and request validation.

2 Docker Deployment 7

Official Ollama Docker image Run Ollama as a containerized service with GPU passthrough and persistent model storage using the official Docker image.

CPU-only Docker run Run Ollama in Docker without GPU access to reduce image size, cost, and complexity while accepting slower inference.

NVIDIA GPU Docker runtime Configure Ollama in Docker to use NVIDIA GPUs so inference runs on hardware accelerators instead of CPU.

Docker Compose for Ollama + application Run Ollama server and a Python client application together in isolated containers with persistent model storage and network communication via Docker Compose.

Volume mount for model persistence Use Docker volume mounts to persist Ollama models across container restarts without re-downloading.

Health check configuration Configure and monitor Ollama server readiness using health check endpoints to ensure model availability before sending requests.

Container resource limits Manage CPU, memory, and GPU allocation when running Ollama in containers to prevent resource contention and OOM crashes.

3 Kubernetes Deployment 7

Ollama Helm chart Deploy Ollama to Kubernetes using Helm, managing model persistence, GPU resources, and multi-replica scaling.

GPU node selection: nodeSelector Use Kubernetes nodeSelector to pin Ollama inference workloads to GPU nodes and avoid CPU-only scheduling.

PersistentVolume for models Mount a Kubernetes PersistentVolume to an ollama pod so model files survive restarts and scale across nodes without re-downloading.

Service and Ingress configuration Expose an Ollama server running in Kubernetes to external traffic using Service and Ingress resources with proper TLS and routing.

Horizontal pod autoscaling: when it applies Kubernetes HPA can scale Ollama inference pods, but only if your model loading and request routing strategy is designed for stateless isolation.

Init container for model pulling Use a Kubernetes init container to pre-pull Ollama models before your application pod starts, avoiding cold-start delays in production.

Rolling updates without downtime Deploy new Ollama model versions to a running service by maintaining multiple instances and gracefully draining connections before shutdown.

4 OpenWebUI and Frontend Integration 7

OpenWebUI: Ollama web interface OpenWebUI is a web-based chat interface that connects to your local Ollama server, letting you interact with models through a browser instead of the CLI or API.

OpenWebUI Docker setup Deploy OpenWebUI in Docker to create a web interface for local Ollama models with proper networking and persistence.

Multi-user OpenWebUI configuration Configure OpenWebUI with authentication, per-user model isolation, and request routing to support multiple concurrent users safely on a shared Ollama server.

Model access control in OpenWebUI Restrict which users can access which Ollama models through OpenWebUI's admin API and role-based permissions.

Custom system prompts per model Override default system prompts at runtime to fine-tune model behavior without retraining or creating new model files.

RAG integration in OpenWebUI Connect OpenWebUI to a local vector database and Ollama to retrieve context from your documents before generating responses.

OpenWebUI vs custom frontend Decide whether to use OpenWebUI as a ready-made interface or build a custom frontend that directly consumes Ollama's API.

5 Model Management at Scale 7

Pre-pulling models: init containers Use Kubernetes init containers to pull Ollama models before your main application starts, eliminating cold-start latency in production.

Model registry: custom artifact storage Override Ollama's default model storage location by configuring custom artifact directories and implementing a registry resolver.

Version pinning: specific model tags Pin exact model versions in Ollama to ensure reproducible behavior across deployments and prevent silent model updates.

Multi-node model distribution Load and serve large language models across multiple machines using Ollama's distributed architecture and load-balancing patterns.

Automated model updates Build a daemon that periodically checks for and pulls newer versions of your Ollama models without manual intervention.

Storage cost optimization Reduce disk footprint and inference latency by strategically managing model quantization, layer pruning, and selective caching in Ollama deployments.

Model catalog management Programmatically inspect, filter, and manage available models in your Ollama instance to build intelligent model selection and orchestration systems.

6 Monitoring and Observability 7

Ollama metrics endpoint Monitor Ollama server performance in real-time by querying the Prometheus-compatible metrics endpoint.

Prometheus integration Scrape Ollama's built-in Prometheus metrics endpoint to monitor model inference performance and resource usage in production.

Grafana dashboard for Ollama Export Ollama metrics to Prometheus and visualize model performance in Grafana to catch inference bottlenecks before users do.

Request latency tracking Measure and log the end-to-end latency of ollama requests to identify bottlenecks in model inference time, token generation speed, and network overhead.

GPU utilization monitoring Monitor and log GPU memory and compute utilization in real-time while running Ollama inference to detect bottlenecks and optimize model serving.

Error rate alerting Monitor and alert on inference failures in production ollama deployments by tracking error rates across requests and triggering notifications when thresholds are breached.

Queue depth monitoring Monitor how many requests are queued in the Ollama server to detect bottlenecks and prevent request pile-up in production.

7 Cost Analysis vs Cloud APIs 7

Hardware cost amortization Calculate the true cost per inference by amortizing your GPU hardware spend across expected lifetime inferences.

Electricity cost calculation Calculate the real-world electricity cost of running local LLM inference on Ollama by measuring token throughput and hardware power consumption.

Total cost of ownership comparison Calculate and compare the real operational costs of running Ollama locally versus cloud-hosted LLM APIs across latency, hardware, and throughput.

Break-even point analysis Calculate the minimum batch size and context window where local Ollama deployment becomes cheaper than API calls by modeling token costs, inference latency, and hardware amortization.

Hybrid strategy: local + cloud overflow Route requests to your local Ollama instance first, then automatically overflow to a cloud LLM provider when the model isn't available or the local instance is saturated.

ROI calculation for private deployment Calculate the true cost and revenue impact of running Ollama models privately versus cloud API services over time.

Scaling cost comparison at different traffic levels Model which deployment strategy (single instance, load-balanced, or auto-scaling) minimizes cost per inference token across realistic traffic patterns.