Local models with Ollama integration
Why this matters
Enterprises need to keep sensitive data on-premises, reduce API costs at scale, and maintain latency guarantees: local models via Ollama solve all three while LlamaIndex handles the orchestration seamlessly.
Explanation
What it is: Ollama is a lightweight runtime that downloads, runs, and manages open-source LLMs (Llama 2, Mistral, etc.) locally on your machine or server. LlamaIndex's Ollama class lets you swap any cloud LLM for a local one with a single line change: the pipeline stays identical.
How it works: When you initialize Ollama(model='mistral'), LlamaIndex sends prompts to a local HTTP endpoint (default: localhost:11434). Ollama manages the model lifecycle: loading weights into VRAM, batching requests, offloading to disk if needed. The response flows back through your RAG chain exactly like an OpenAI call would, but with zero API latency and no credential risk.
When to use it: You have a local development machine or private inference server, model latency is acceptable (100ms–2s typical for Mistral 7B on consumer GPU), and either data sensitivity or cost per inference makes cloud prohibitive. Production deployments pair Ollama with LlamaIndex indexing for sub-second retrieval + local reasoning.
Analogy
Ollama is like running your own coffee machine instead of using a delivery service: slower than someone bringing it to you instantly, but you control the recipe, no one sees your ingredients, and you pay once instead of per cup.
Code
import subprocess
import time
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
# Start Ollama server (ensure ollama serve is running separately)
# In production, Ollama daemon runs as a service
# Configure LlamaIndex to use local Ollama for both LLM and embeddings
llm = Ollama(model='mistral', base_url='http://localhost:11434')
Settings.llm = llm
Settings.embed_model = OllamaEmbedding(model_name='nomic-embed-text', base_url='http://localhost:11434')
# Create sample documents for indexing
sample_docs = [
{
'text': 'Machine learning is a subset of artificial intelligence that focuses on learning from data.',
'metadata': {'source': 'ml_intro.txt'}
},
{
'text': 'Embeddings convert text into high-dimensional vectors that capture semantic meaning.',
'metadata': {'source': 'embeddings.txt'}
},
{
'text': 'Vector databases store embeddings for efficient similarity search and retrieval.',
'metadata': {'source': 'vectordb.txt'}
}
]
from llama_index.core.schema import Document
documents = [Document(text=doc['text'], metadata=doc['metadata']) for doc in sample_docs]
# Build index with local embeddings
index = VectorStoreIndex.from_documents(documents)
# Query using local Ollama LLM
query_engine = index.as_query_engine()
response = query_engine.query('What is the relationship between embeddings and vector databases?')
print('Query:')
print('What is the relationship between embeddings and vector databases?')
print()
print('Response:')
print(response.response) Query: What is the relationship between embeddings and vector databases? Response: Embeddings are vector representations of text that capture semantic meaning in high-dimensional space. Vector databases are specialized systems designed to store and index these embeddings, enabling efficient similarity search and retrieval. Together, they form the backbone of semantic search systems: embeddings convert unstructured text into queryable vectors, and vector databases make searching across millions of these vectors fast and scalable. This combination allows systems to find semantically similar documents without exact keyword matches.
What just happened?
The code initialized Ollama's mistral model and nomic-embed-text embedding model pointing to localhost:11434 (where an Ollama server must be running). It created three sample documents, built a VectorStoreIndex using local embeddings, then queried that index using the local LLM. Both embedding generation and response generation happened entirely on your machine: no API calls, no cloud round-trips.
Common gotcha
Developers assume Ollama runs automatically or think from llama_index.llms.ollama import Ollama starts it: it doesn't. You must run ollama serve in a separate terminal or container before any query executes. If you skip this, you get a cryptic Connection refused on localhost:11434 error. Also, model download is lazy but slow on first use: ollama pull mistral beforehand if you need predictable startup times.
Error recovery
ConnectionRefusedError: [Errno 111] Connection refusedValueError: Model 'mistral' not foundOllamaEmbedding does not existCUDA out of memoryExperienced dev note
A senior dev knows: Ollama is stateful: multiple processes can hit the same daemon safely, but model switching (swapping mistral for llama2) blocks all queries while it reloads weights. In production, run one Ollama instance per model, or use a load balancer. Also, quantized models (default in Ollama) trade 5–15% accuracy for 3–4x memory savings: acceptable for retrieval tasks, risky for reasoning. Test quantization impact on your specific data before deploying.
Check your understanding
Your RAG system uses local Ollama for embeddings but cloud OpenAI for the final answer LLM. If Ollama crashes mid-indexing, what data is lost and what remains queryable? Why?
Show answer hint
A correct answer identifies that embeddings already generated and stored in the vector index persist (they're not in-memory), but in-flight document embeddings are lost. Remaining documents stay queryable until re-indexed. The key insight is understanding where state lives: embeddings are persisted after generation, but the Ollama process itself is ephemeral for indexing.