Ollama for development use
Why this matters
Local model serving eliminates API latency, reduces costs, enables offline work, and lets you experiment with different Qwen versions without cloud account friction: critical for development velocity.
Explanation
Ollama is a lightweight container that runs quantized LLMs locally on your machine. You pull a model once, then interact with it via a simple REST API on localhost:11434. It handles memory management, quantization (typically Q4_0 or Q5_K), and model loading so you don't need to manage CUDA/PyTorch complexity yourself. Mechanically: Ollama runs as a daemon, exposes an OpenAI-compatible endpoint, and streams responses. You send HTTP POST requests with your prompt; Ollama handles tokenization, inference, and cleanup. When to use it: local development, prompt iteration, testing system integrations before pushing to production, offline environments, or learning: anywhere you need fast feedback loops without infrastructure overhead.
Analogy
Ollama is like SQLite for LLMs: a single-file database you can run anywhere versus managing a PostgreSQL server. Zero setup friction, perfect for local development, but you wouldn't ship SQLite as your production database.
Code
#!/usr/bin/env python3
import requests
import json
import time
def query_ollama(prompt: str, model: str = "qwen2.5:7b") -> str:
"""
Send a prompt to Ollama and return the response.
Assumes Ollama is running at localhost:11434
"""
url = "http://localhost:11434/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": False
}
response = requests.post(url, json=payload, timeout=60)
response.raise_for_status()
result = response.json()
return result["response"]
def stream_ollama(prompt: str, model: str = "qwen2.5:7b"):
"""
Stream responses from Ollama line by line.
Useful for watching token generation in real-time.
"""
url = "http://localhost:11434/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": True
}
with requests.post(url, json=payload, timeout=120, stream=True) as resp:
resp.raise_for_status()
for line in resp.iter_lines():
if line:
chunk = json.loads(line)
print(chunk["response"], end="", flush=True)
print()
def get_model_info(model: str = "qwen2.5:7b") -> dict:
"""
Retrieve metadata about a loaded model.
"""
url = "http://localhost:11434/api/show"
payload = {"name": model}
response = requests.post(url, json=payload, timeout=10)
response.raise_for_status()
return response.json()
if __name__ == "__main__":
try:
print("=== Testing Ollama Connection ===")
info = get_model_info("qwen2.5:7b")
print(f"Model: {info.get('name')}")
print(f"Parameters: {info.get('parameters')}")
print(f"Quantization: {info.get('quantization')}")
print("\n=== Single Response ===")
result = query_ollama("What is 2+2? Answer in one sentence.")
print(result)
print("\n=== Streaming Response ===")
stream_ollama("List three benefits of local LLM inference in bullet points.")
except requests.exceptions.ConnectionError:
print("ERROR: Ollama not running. Start it with: ollama serve")
print("Then: ollama pull qwen2.5:7b")
except Exception as e:
print(f"ERROR: {type(e).__name__}: {e}") === Testing Ollama Connection === Model: qwen2.5:7b Parameters: 7.61B Quantization: Q4_0 === Single Response === 2 + 2 = 4. === Streaming Response === Here are three benefits of local LLM inference: • **No internet dependency** - Your model runs entirely offline, perfect for secure environments or when connectivity is unreliable. • **Lower latency** - Network round-trips to cloud APIs are eliminated, making response times faster for real-time applications. • **Cost efficiency** - You pay once for hardware; no per-request API charges accumulate as you iterate during development.
What just happened?
The code opened three HTTP channels to Ollama's REST API: first to fetch model metadata (quantization level, parameter count), second to send a synchronous prompt and wait for the complete response, third to stream a response token-by-token. Each request hit localhost:11434, Ollama parsed the JSON payload, ran inference on the Qwen2.5-7B model loaded in memory, and returned either buffered JSON or newline-delimited JSON chunks.
Common gotcha
Developers forget that Ollama must be running as a separate daemon process: the Python script won't start the server automatically. If you see `ConnectionError: Failed to establish a new connection`, you need `ollama serve` running in another terminal. Also: the first query against a model is slow (model loading into VRAM), but subsequent queries are fast. Don't benchmark the first call.
Error recovery
ConnectionError: Failed to establish a new connectionrequests.exceptions.Timeoutjson.JSONDecodeError on responsecurl: (7) Failed to connect to localhost port 11434Experienced dev note
Model quantization matters more than you think in development. Qwen2.5-7B Q4_0 (4-bit) uses ~4GB VRAM and generates ~30 tokens/sec on M1/M2 Mac; full precision uses 14GB and is rarely necessary for prompt iteration. Load `qwen2.5:3.5b` if you're memory-constrained. Also: Ollama caches the loaded model in VRAM across requests: the daemon process stays alive, so your second script invocation runs 3-5x faster than the first. This is a feature, not a bug, but it means VRAM is reserved until you restart Ollama.
Check your understanding
Why would streaming responses (stream=True) be preferable to buffered responses during development, and what would be a scenario where you'd want buffered responses instead?
Show answer hint
A correct answer distinguishes between user experience (watching tokens appear live vs. waiting for complete response) and architectural concerns (simplicity, error handling, retry logic). Buffered makes sense for validation scripts; streaming for interactive CLI tools or frontend chat interfaces.