Code Advanced hard · 8 min

Local models with Ollama integration

What you will learn

Run private LLMs locally via Ollama and integrate them into LlamaIndex RAG pipelines without cloud dependencies.

Why this matters

Enterprises need to keep sensitive data on-premises, reduce API costs at scale, and maintain latency guarantees: local models via Ollama solve all three while LlamaIndex handles the orchestration seamlessly.

Skip if: Don't use local Ollama models if you need state-of-the-art reasoning (GPT-4 level), require sub-100ms latency on CPU hardware, or are prototyping and need fast iteration with proven models: cloud APIs are faster to validate.

Explanation

What it is: Ollama is a lightweight runtime that downloads, runs, and manages open-source LLMs (Llama 2, Mistral, etc.) locally on your machine or server. LlamaIndex's Ollama class lets you swap any cloud LLM for a local one with a single line change: the pipeline stays identical.

How it works: When you initialize Ollama(model='mistral'), LlamaIndex sends prompts to a local HTTP endpoint (default: localhost:11434). Ollama manages the model lifecycle: loading weights into VRAM, batching requests, offloading to disk if needed. The response flows back through your RAG chain exactly like an OpenAI call would, but with zero API latency and no credential risk.

When to use it: You have a local development machine or private inference server, model latency is acceptable (100ms–2s typical for Mistral 7B on consumer GPU), and either data sensitivity or cost per inference makes cloud prohibitive. Production deployments pair Ollama with LlamaIndex indexing for sub-second retrieval + local reasoning.

Analogy

Ollama is like running your own coffee machine instead of using a delivery service: slower than someone bringing it to you instantly, but you control the recipe, no one sees your ingredients, and you pay once instead of per cup.

Code

Illustrative only - not runnable without a valid API key

python

import subprocess
import time
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Start Ollama server (ensure ollama serve is running separately)
# In production, Ollama daemon runs as a service

# Configure LlamaIndex to use local Ollama for both LLM and embeddings
llm = Ollama(model='mistral', base_url='http://localhost:11434')
Settings.llm = llm
Settings.embed_model = OllamaEmbedding(model_name='nomic-embed-text', base_url='http://localhost:11434')

# Create sample documents for indexing
sample_docs = [
    {
        'text': 'Machine learning is a subset of artificial intelligence that focuses on learning from data.',
        'metadata': {'source': 'ml_intro.txt'}
    },
    {
        'text': 'Embeddings convert text into high-dimensional vectors that capture semantic meaning.',
        'metadata': {'source': 'embeddings.txt'}
    },
    {
        'text': 'Vector databases store embeddings for efficient similarity search and retrieval.',
        'metadata': {'source': 'vectordb.txt'}
    }
]

from llama_index.core.schema import Document
documents = [Document(text=doc['text'], metadata=doc['metadata']) for doc in sample_docs]

# Build index with local embeddings
index = VectorStoreIndex.from_documents(documents)

# Query using local Ollama LLM
query_engine = index.as_query_engine()
response = query_engine.query('What is the relationship between embeddings and vector databases?')

print('Query:')
print('What is the relationship between embeddings and vector databases?')
print()
print('Response:')
print(response.response)

Output

Query:
What is the relationship between embeddings and vector databases?

Response:
Embeddings are vector representations of text that capture semantic meaning in high-dimensional space. Vector databases are specialized systems designed to store and index these embeddings, enabling efficient similarity search and retrieval. Together, they form the backbone of semantic search systems: embeddings convert unstructured text into queryable vectors, and vector databases make searching across millions of these vectors fast and scalable. This combination allows systems to find semantically similar documents without exact keyword matches.

What just happened?

The code initialized Ollama's mistral model and nomic-embed-text embedding model pointing to localhost:11434 (where an Ollama server must be running). It created three sample documents, built a VectorStoreIndex using local embeddings, then queried that index using the local LLM. Both embedding generation and response generation happened entirely on your machine: no API calls, no cloud round-trips.

Common gotcha

Developers assume Ollama runs automatically or think from llama_index.llms.ollama import Ollama starts it: it doesn't. You must run ollama serve in a separate terminal or container before any query executes. If you skip this, you get a cryptic Connection refused on localhost:11434 error. Also, model download is lazy but slow on first use: ollama pull mistral beforehand if you need predictable startup times.

Error recovery

ConnectionRefusedError: [Errno 111] Connection refused

Ollama server is not running. Fix: Open a terminal, run 'ollama serve', then keep it running. Verify with 'curl http://localhost:11434': should return Ollama's status page.

ValueError: Model 'mistral' not found

The model hasn't been pulled to disk yet. Fix: Run 'ollama pull mistral' in a terminal. This downloads ~4GB. After pulling, retry your query.

OllamaEmbedding does not exist

Using old llama-index version < 0.10.0. Fix: Upgrade with 'pip install --upgrade llama-index-embeddings-ollama' and import from 'llama_index.embeddings.ollama'.

CUDA out of memory

Model weights exceed your GPU VRAM. Fix: Either reduce context length (set max_tokens=512), use a smaller model like 'neural-chat' (4B params), or enable CPU offloading (Ollama does this automatically if needed, but it's slow).

Experienced dev note

A senior dev knows: Ollama is stateful: multiple processes can hit the same daemon safely, but model switching (swapping mistral for llama2) blocks all queries while it reloads weights. In production, run one Ollama instance per model, or use a load balancer. Also, quantized models (default in Ollama) trade 5–15% accuracy for 3–4x memory savings: acceptable for retrieval tasks, risky for reasoning. Test quantization impact on your specific data before deploying.

Check your understanding

Your RAG system uses local Ollama for embeddings but cloud OpenAI for the final answer LLM. If Ollama crashes mid-indexing, what data is lost and what remains queryable? Why?

Show answer hint

A correct answer identifies that embeddings already generated and stored in the vector index persist (they're not in-memory), but in-flight document embeddings are lost. Remaining documents stay queryable until re-indexed. The key insight is understanding where state lives: embeddings are persisted after generation, but the Ollama process itself is ephemeral for indexing.

VERSION llama-index-core >= 0.10.0 (April 2026): OllamaEmbedding moved from llama_index.embeddings to llama_index.embeddings.ollama. Earlier versions used legacy ServiceContext patterns: those are incompatible with this example.

Learn how to optimize Ollama inference speed by configuring context windows and batch sizes, then see how to persist indexed embeddings to avoid re-embedding on restart.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.