Structured Course
Ollama
From first install to production patterns. Every lesson is standalone — jump to what you need, or work through from beginner to advanced.
145 lessons 3 levels Beginner → Advanced
Beginner
What Ollama Is and Why It Exists 5
Installation and Setup 7
Pulling and Running Models 7
+4 more chapters
Intermediate
GPU Acceleration 7
Custom Modelfiles 7
Tool Calling with Ollama 7
+4 more chapters
Advanced
Remote Ollama Server 7
Docker Deployment 7
Kubernetes Deployment 7
+4 more chapters
Full Course Contents
Beginner
47 lessons 1 What Ollama Is and Why It Exists 5
1
The local LLM problem before Ollama Running LLMs locally required manual Docker setup, GPU driver wrangling, and HTTP request plumbing: Ollama eliminated all that friction.
2 What Ollama provides: one command to run any model Ollama lets you run open-source LLMs locally with a single Python function call, no GPU setup or Docker knowledge required.
3 Ollama vs running transformers directly Ollama is a runtime that manages model loading and inference, while transformers library gives you direct control: pick based on whether you need convenience or customization.
4 Privacy guarantee: no data leaves your machine Ollama runs language models entirely on your machine: no API calls, no server uploads, no data leakage.
5 When local LLMs beat cloud APIs Local LLMs with Ollama eliminate latency, cost, and privacy concerns when your use case doesn't require frontier models.
2 Installation and Setup 7
1
Installing Ollama: macOS, Linux, Windows Install Ollama on your machine and verify it runs a local LLM via Python.
2 Verifying installation: ollama --version Check that ollama is installed correctly by running the version command from Python.
3 ollama serve: starting the server Start the Ollama server to enable your Python code to talk to local language models.
4 Server auto-start on system boot Configure ollama to start automatically when your system boots, so the server is always ready without manual intervention.
5 Port configuration: default 11434 Ollama runs on port 11434 by default, but you can change it by setting the OLLAMA_HOST environment variable before starting the server.
6 OLLAMA_HOST for remote access OLLAMA_HOST is an environment variable that tells the Ollama server which network address and port to listen on, enabling access from other machines instead of localhost only.
7 Firewall considerations Ollama runs a local API server on port 11434 by default: you need to understand firewall rules and network exposure to avoid accidentally making your models publicly accessible.
3 Pulling and Running Models 7
1
ollama pull: downloading a model Use <code>ollama pull</code> to download a model from the Ollama registry so you can run it locally.
2 Model naming: llama3.2, mistral, qwen2.5 Model names in Ollama tell you the base model family and version: they control which weights are pulled and how your prompts behave.
3 ollama run: interactive chat Use the ollama Python library to send messages to a running model and receive responses in an interactive conversation.
4 ollama list: what is installed Use <code>ollama.list()</code> to see which models are downloaded and ready to use on your machine.
5 Model storage location Ollama stores downloaded models in a platform-specific directory you can inspect and customize via the OLLAMA_MODELS environment variable.
6 Disk space requirements by model Different Ollama models consume vastly different amounts of disk space: from 2GB to 100GB+: and you need to check before pulling.
7 ollama rm: removing models Remove downloaded models from your local Ollama installation to free disk space and clean up unused models.
4 The Ollama API 7
1
REST API: OpenAI-compatible endpoints Ollama exposes an OpenAI-compatible REST API so you can swap it into existing OpenAI code without rewriting.
2 /api/generate: text generation Use the /api/generate endpoint to stream text generation from a local Ollama model one token at a time.
3 /api/chat: chat completion Use the chat endpoint to have multi-turn conversations with a local Ollama model by sending message history.
4 Request format: model and prompt Every Ollama request requires two things: a model name and a message: learn the minimal structure that makes an API call work.
5 Response streaming: JSON lines Stream an LLM response token-by-token using ollama's JSON lines format instead of waiting for the full response.
6 /api/tags: listing models Query which models are installed and available on your local Ollama server.
7 /api/show: model information Use the <code>/api/show</code> endpoint to inspect detailed metadata about a loaded model, including its parameters, quantization, and size.
5 Python Integration 7
1
Using OpenAI SDK with Ollama Ollama provides an OpenAI-compatible API endpoint so you can use the official OpenAI Python SDK against local models without changing your code.
2 Configuring base_url for Ollama API Calls The <code>base_url</code> parameter tells the ollama Python client where your Ollama server is running so it can send requests to the correct address.
3 api_key="ollama": dummy key required Ollama requires a dummy API key string to initialize the client, even though it's not actually authenticating anything.
4 chat.completions.create() with local model Use ollama's OpenAI-compatible API to send chat messages to a local model and get streaming or complete responses.
5 ollama Python library: native SDK Use the official ollama Python library to talk to local LLMs without writing HTTP requests yourself.
6 ollama.chat() vs OpenAI SDK Ollama's native API is simpler and runs locally; OpenAI SDK requires a network call to remote servers but offers more features.
7 Streaming with Ollama Python SDK Use the <code>stream=True</code> parameter to receive model responses token-by-token instead of waiting for the complete answer.
6 Model Selection 7
1
ollama.com/library: model catalog The ollama.com/library is the public registry where you discover, download, and manage pre-built language models.
2 Model sizes: 1B to 70B+ available Ollama runs models of vastly different sizes locally, from lightweight 1B parameter models to massive 70B+ models, and your choice depends on hardware and latency requirements.
3 Quantization tags: :q4_0, :q8_0 Quantization tags like :q4_0 and :q8_0 control how much a model is compressed, trading accuracy for memory and speed.
4 Hardware matching: RAM to model size Match your available RAM to the model size you want to run, because loading a model larger than your RAM will cause severe slowdown or failure.
5 Speed vs quality tradeoffs Smaller, faster models trade quality for response speed: choose based on your latency and accuracy needs.
6 Recommended models by use case Choose the right Ollama model for your task by understanding speed vs. quality tradeoffs.
7 Pulling specific quantization variants Learn how to pull different quantized versions of the same model to balance speed, memory, and accuracy.
7 Common Issues and Fixes 7
1
ollama not found: PATH issue When you install ollama but Python can't find the command, it's a PATH problem: here's how to fix it.
2 Server not running: port connection refused Diagnose and fix the most common connection error when ollama client code can't reach the ollama server.
3 Out of memory: model too large When a model consumes more RAM than available, Ollama fails to load it: understand why and how to detect it before running out of space.
4 Slow generation: no GPU detected When Ollama runs on CPU instead of GPU, generation becomes 10-100x slower: here's how to detect and fix it.
5 Model pull fails: network timeout When `ollama pull` times out, configure the HTTP client timeout and verify your connection to the model registry.
6 API returning 404: wrong endpoint path Ollama endpoints follow a specific path structure, and using HTTP requests instead of the Python library is the most common way to hit 404 errors.
7 Model context length exceeded When your prompt + response exceeds a model's maximum token limit, the request fails: and you need to know how to detect and handle it.
Intermediate
49 lessons 1 GPU Acceleration 7
1
NVIDIA GPU detection: nvidia-smi check Verify NVIDIA GPU availability and detect compute capability before running Ollama models.
2 CUDA setup for Ollama on Linux Configure NVIDIA CUDA acceleration for Ollama on Linux to run large language models at practical inference speeds.
3 Metal acceleration on macOS Enable GPU acceleration on macOS using Metal to run large language models 10–100x faster than CPU-only inference.
4 AMD ROCm support Enable GPU acceleration on AMD graphics cards by configuring ollama with ROCm support and verifying GPU detection.
5 GPU layer configuration: num_gpu Control how many layers of your model run on GPU versus CPU by setting num_gpu in Ollama to balance speed and memory usage.
6 Mixed CPU/GPU execution Control which layers of a model run on GPU vs CPU to optimize memory usage and inference speed on resource-constrained hardware.
7 Monitoring GPU usage during inference Track GPU memory and compute utilization in real time while running ollama inference to identify bottlenecks and validate hardware acceleration.
2 Custom Modelfiles 7
1
What a Modelfile is A Modelfile is a configuration blueprint that defines how to build and run a custom Ollama model, similar to a Dockerfile for containers.
2 FROM: base model selection Choosing the right base model in a Modelfile determines your local LLM's capabilities, speed, and resource consumption.
3 SYSTEM: setting system prompt Control how a model behaves by defining its system role and instructions before user messages.
4 PARAMETER: model defaults Ollama uses sensible defaults for model parameters like temperature and top_k, but you can override them per request or set them persistently in a Modelfile.
5 TEMPLATE: custom chat format Define your own chat message structure and prompt template to control exactly how messages are formatted when sent to the model.
6 ollama create: building from Modelfile Create custom Ollama models by writing a Modelfile that specifies base model, system prompt, parameters, and other configurations.
7 Iterating on a custom model Build a custom Ollama model from a base model, test it, modify the Modelfile, and rebuild to see changes without losing your work.
3 Tool Calling with Ollama 7
1
Models that support tool calling: Llama 3.1+, Qwen2.5, Mistral Tool calling lets language models request specific functions you define, enabling structured interactions instead of just text.
2 Tool definition format Define structured inputs for model function calls using JSON schema so models know what parameters they can request.
3 OpenAI SDK tool calling with Ollama Use the OpenAI Python SDK to call tools with Ollama models by routing requests through a local API-compatible server.
4 Parsing tool call responses Extract and parse structured tool calls from Ollama's response messages when the model invokes external functions.
5 Executing tools and returning results Use Ollama's tool calling to let language models invoke functions and process their results in a loop.
6 Reliability of local tool calling Local tool calling in Ollama can fail silently or produce malformed function calls: you need explicit validation and retry logic to make it reliable in production.
7 Fallback when tool calling fails Design resilient tool-calling workflows by catching failures and falling back to direct LLM responses or alternative actions.
4 Embeddings with Ollama 7
1
Embedding-capable models: nomic-embed-text, mxbai-embed-large Use specialized embedding models to convert text into high-dimensional vectors for semantic search and similarity tasks, separate from chat models.
2 /api/embeddings endpoint Generate vector embeddings from text using Ollama's embeddings endpoint to power semantic search and similarity comparisons.
3 ollama.embeddings() Python call Convert text into numerical vectors using Ollama's embeddings API to enable semantic similarity search and clustering.
4 Embedding dimensions by model Different Ollama embedding models produce vectors of different dimensions, which affects downstream task compatibility and memory usage.
5 Batch embedding multiple texts Generate embeddings for multiple texts at once instead of calling the API for each one individually.
6 Cosine similarity calculation Measure how similar two text embeddings are by calculating the angle between them in vector space.
7 Local semantic search with Ollama embeddings Use Ollama's embedding models to convert text into vectors, then find semantically similar documents without external APIs.
5 Multimodal Models 7
1
Vision models in Ollama: llava, minicpm-v Use vision models like LLaVA and MiniCPM-V to analyze images directly in Ollama without external APIs.
2 Image input format: base64 encoding Ollama's vision-capable models require images as base64-encoded strings, not file paths.
3 ollama.chat() with images Pass image data directly to ollama.chat() so vision models can analyze photos, diagrams, and screenshots.
4 Multiple images per message Send multiple images in a single message to multimodal models like llava for batch image analysis.
5 Vision model hardware requirements Vision models in Ollama require significantly more VRAM and compute than text-only models: you must validate hardware before deployment.
6 What local vision can and cannot do Local vision models can describe and analyze images but struggle with fine detail, reasoning about text, and spatial relationships: understanding these limits prevents integration failures.
7 Vision quality comparison to GPT-4o Understand how local vision models in Ollama compare to GPT-4o by running identical image analysis tasks and comparing outputs side-by-side.
6 LangChain and LlamaIndex Integration 7
1
OllamaLLM: LangChain wrapper Use LangChain's OllamaLLM to integrate local Ollama models into LangChain chains and agents with a unified interface.
2 ChatOllama for chat models ChatOllama wraps local Ollama models as LangChain chat interfaces, letting you drop local LLMs into chat applications without API calls.
3 OllamaEmbeddings for RAG Use Ollama's embedding models to convert documents into vectors for retrieval-augmented generation (RAG) without external APIs.
4 LlamaIndex OllamaLLM Use LlamaIndex's OllamaLLM wrapper to integrate local Ollama models into retrieval-augmented generation (RAG) pipelines.
5 Using Ollama in a RAG chain Build a retrieval-augmented generation pipeline that fetches relevant documents and feeds them to a local Ollama model for context-aware responses.
6 Streaming with LangChain + Ollama Stream token-by-token responses from Ollama through LangChain's LCEL to show results in real-time instead of waiting for the full response.
7 Local-only RAG pipeline Build a retrieval-augmented generation system that runs entirely locally: model, embeddings, and document storage: without touching external APIs.
7 Concurrent Requests and Performance 7
1
OLLAMA_NUM_PARALLEL: parallel requests Control how many inference requests Ollama processes simultaneously by setting the OLLAMA_NUM_PARALLEL environment variable.
2 OLLAMA_MAX_LOADED_MODELS: multiple models Control how many models Ollama keeps loaded in memory simultaneously to balance performance and resource usage.
3 Request queuing behavior Ollama queues requests in memory when concurrent calls exceed available GPU capacity, blocking until space becomes available.
4 Context length per parallel request Each parallel request to Ollama consumes its own context window independently, and exceeding a model's context limit will silently truncate or fail on a per-request basis.
5 Memory scaling for concurrency Manage GPU and system memory to handle multiple concurrent Ollama requests without crashes or thrashing.
6 Benchmarking concurrent throughput Measure how many simultaneous requests your local Ollama instance can handle by spawning concurrent clients and tracking response times.
7 When parallelism hurts performance Spawning too many concurrent Ollama requests starves the GPU and context, causing slower total throughput than sequential processing.
Advanced
49 lessons 1 Remote Ollama Server 7
1
OLLAMA_HOST=0.0.0.0: network binding OLLAMA_HOST=0.0.0.0 exposes your Ollama server to all network interfaces instead of just localhost, enabling remote access from other machines.
2 Nginx reverse proxy for Ollama Route multiple Ollama instances and manage request load through Nginx while preserving streaming responses.
3 HTTPS with Let's Encrypt Secure your Ollama API endpoints with HTTPS using Let's Encrypt certificates and a reverse proxy.
4 API key auth via reverse proxy Secure Ollama behind a reverse proxy that validates API keys before forwarding requests to the model server.
5 Rate limiting at the proxy level Implement token-bucket rate limiting in front of Ollama to prevent model overload and enforce fair resource allocation across clients.
6 Remote client configuration Configure the Ollama Python client to connect to a remote Ollama server instead of localhost, with proper host validation and error handling.
7 Security hardening for remote serving Secure an Ollama server for remote access by implementing authentication, network isolation, and request validation.
2 Docker Deployment 7
1
Official Ollama Docker image Run Ollama as a containerized service with GPU passthrough and persistent model storage using the official Docker image.
2 CPU-only Docker run Run Ollama in Docker without GPU access to reduce image size, cost, and complexity while accepting slower inference.
3 NVIDIA GPU Docker runtime Configure Ollama in Docker to use NVIDIA GPUs so inference runs on hardware accelerators instead of CPU.
4 Docker Compose for Ollama + application Run Ollama server and a Python client application together in isolated containers with persistent model storage and network communication via Docker Compose.
5 Volume mount for model persistence Use Docker volume mounts to persist Ollama models across container restarts without re-downloading.
6 Health check configuration Configure and monitor Ollama server readiness using health check endpoints to ensure model availability before sending requests.
7 Container resource limits Manage CPU, memory, and GPU allocation when running Ollama in containers to prevent resource contention and OOM crashes.
3 Kubernetes Deployment 7
1
Ollama Helm chart Deploy Ollama to Kubernetes using Helm, managing model persistence, GPU resources, and multi-replica scaling.
2 GPU node selection: nodeSelector Use Kubernetes nodeSelector to pin Ollama inference workloads to GPU nodes and avoid CPU-only scheduling.
3 PersistentVolume for models Mount a Kubernetes PersistentVolume to an ollama pod so model files survive restarts and scale across nodes without re-downloading.
4 Service and Ingress configuration Expose an Ollama server running in Kubernetes to external traffic using Service and Ingress resources with proper TLS and routing.
5 Horizontal pod autoscaling: when it applies Kubernetes HPA can scale Ollama inference pods, but only if your model loading and request routing strategy is designed for stateless isolation.
6 Init container for model pulling Use a Kubernetes init container to pre-pull Ollama models before your application pod starts, avoiding cold-start delays in production.
7 Rolling updates without downtime Deploy new Ollama model versions to a running service by maintaining multiple instances and gracefully draining connections before shutdown.
4 OpenWebUI and Frontend Integration 7
1
OpenWebUI: Ollama web interface OpenWebUI is a web-based chat interface that connects to your local Ollama server, letting you interact with models through a browser instead of the CLI or API.
2 OpenWebUI Docker setup Deploy OpenWebUI in Docker to create a web interface for local Ollama models with proper networking and persistence.
3 Multi-user OpenWebUI configuration Configure OpenWebUI with authentication, per-user model isolation, and request routing to support multiple concurrent users safely on a shared Ollama server.
4 Model access control in OpenWebUI Restrict which users can access which Ollama models through OpenWebUI's admin API and role-based permissions.
5 Custom system prompts per model Override default system prompts at runtime to fine-tune model behavior without retraining or creating new model files.
6 RAG integration in OpenWebUI Connect OpenWebUI to a local vector database and Ollama to retrieve context from your documents before generating responses.
7 OpenWebUI vs custom frontend Decide whether to use OpenWebUI as a ready-made interface or build a custom frontend that directly consumes Ollama's API.
5 Model Management at Scale 7
1
Pre-pulling models: init containers Use Kubernetes init containers to pull Ollama models before your main application starts, eliminating cold-start latency in production.
2 Model registry: custom artifact storage Override Ollama's default model storage location by configuring custom artifact directories and implementing a registry resolver.
3 Version pinning: specific model tags Pin exact model versions in Ollama to ensure reproducible behavior across deployments and prevent silent model updates.
4 Multi-node model distribution Load and serve large language models across multiple machines using Ollama's distributed architecture and load-balancing patterns.
5 Automated model updates Build a daemon that periodically checks for and pulls newer versions of your Ollama models without manual intervention.
6 Storage cost optimization Reduce disk footprint and inference latency by strategically managing model quantization, layer pruning, and selective caching in Ollama deployments.
7 Model catalog management Programmatically inspect, filter, and manage available models in your Ollama instance to build intelligent model selection and orchestration systems.
6 Monitoring and Observability 7
1
Ollama metrics endpoint Monitor Ollama server performance in real-time by querying the Prometheus-compatible metrics endpoint.
2 Prometheus integration Scrape Ollama's built-in Prometheus metrics endpoint to monitor model inference performance and resource usage in production.
3 Grafana dashboard for Ollama Export Ollama metrics to Prometheus and visualize model performance in Grafana to catch inference bottlenecks before users do.
4 Request latency tracking Measure and log the end-to-end latency of ollama requests to identify bottlenecks in model inference time, token generation speed, and network overhead.
5 GPU utilization monitoring Monitor and log GPU memory and compute utilization in real-time while running Ollama inference to detect bottlenecks and optimize model serving.
6 Error rate alerting Monitor and alert on inference failures in production ollama deployments by tracking error rates across requests and triggering notifications when thresholds are breached.
7 Queue depth monitoring Monitor how many requests are queued in the Ollama server to detect bottlenecks and prevent request pile-up in production.
7 Cost Analysis vs Cloud APIs 7
1
Hardware cost amortization Calculate the true cost per inference by amortizing your GPU hardware spend across expected lifetime inferences.
2 Electricity cost calculation Calculate the real-world electricity cost of running local LLM inference on Ollama by measuring token throughput and hardware power consumption.
3 Total cost of ownership comparison Calculate and compare the real operational costs of running Ollama locally versus cloud-hosted LLM APIs across latency, hardware, and throughput.
4 Break-even point analysis Calculate the minimum batch size and context window where local Ollama deployment becomes cheaper than API calls by modeling token costs, inference latency, and hardware amortization.
5 Hybrid strategy: local + cloud overflow Route requests to your local Ollama instance first, then automatically overflow to a cloud LLM provider when the model isn't available or the local instance is saturated.
6 ROI calculation for private deployment Calculate the true cost and revenue impact of running Ollama models privately versus cloud API services over time.
7 Scaling cost comparison at different traffic levels Model which deployment strategy (single instance, load-balanced, or auto-scaling) minimizes cost per inference token across realistic traffic patterns.