Code Intermediate medium · 6 min

Ollama for development use

What you will learn

Run Qwen models locally via Ollama without cloud dependencies for faster iteration and offline development.

Why this matters

Local model serving eliminates API latency, reduces costs, enables offline work, and lets you experiment with different Qwen versions without cloud account friction: critical for development velocity.

Skip if: Don't use Ollama for production inference at scale: use dedicated inference servers (vLLM, TGI) or cloud APIs. Don't use it if you need GPU load balancing across machines or strict SLA compliance.

Explanation

Ollama is a lightweight container that runs quantized LLMs locally on your machine. You pull a model once, then interact with it via a simple REST API on localhost:11434. It handles memory management, quantization (typically Q4_0 or Q5_K), and model loading so you don't need to manage CUDA/PyTorch complexity yourself. Mechanically: Ollama runs as a daemon, exposes an OpenAI-compatible endpoint, and streams responses. You send HTTP POST requests with your prompt; Ollama handles tokenization, inference, and cleanup. When to use it: local development, prompt iteration, testing system integrations before pushing to production, offline environments, or learning: anywhere you need fast feedback loops without infrastructure overhead.

Analogy

Ollama is like SQLite for LLMs: a single-file database you can run anywhere versus managing a PostgreSQL server. Zero setup friction, perfect for local development, but you wouldn't ship SQLite as your production database.

Code

python

#!/usr/bin/env python3
import requests
import json
import time

def query_ollama(prompt: str, model: str = "qwen2.5:7b") -> str:
    """
    Send a prompt to Ollama and return the response.
    Assumes Ollama is running at localhost:11434
    """
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    
    response = requests.post(url, json=payload, timeout=60)
    response.raise_for_status()
    
    result = response.json()
    return result["response"]

def stream_ollama(prompt: str, model: str = "qwen2.5:7b"):
    """
    Stream responses from Ollama line by line.
    Useful for watching token generation in real-time.
    """
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": True
    }
    
    with requests.post(url, json=payload, timeout=120, stream=True) as resp:
        resp.raise_for_status()
        for line in resp.iter_lines():
            if line:
                chunk = json.loads(line)
                print(chunk["response"], end="", flush=True)
    print()

def get_model_info(model: str = "qwen2.5:7b") -> dict:
    """
    Retrieve metadata about a loaded model.
    """
    url = "http://localhost:11434/api/show"
    payload = {"name": model}
    
    response = requests.post(url, json=payload, timeout=10)
    response.raise_for_status()
    
    return response.json()

if __name__ == "__main__":
    try:
        print("=== Testing Ollama Connection ===")
        info = get_model_info("qwen2.5:7b")
        print(f"Model: {info.get('name')}")
        print(f"Parameters: {info.get('parameters')}")
        print(f"Quantization: {info.get('quantization')}")
        
        print("\n=== Single Response ===")
        result = query_ollama("What is 2+2? Answer in one sentence.")
        print(result)
        
        print("\n=== Streaming Response ===")
        stream_ollama("List three benefits of local LLM inference in bullet points.")
        
    except requests.exceptions.ConnectionError:
        print("ERROR: Ollama not running. Start it with: ollama serve")
        print("Then: ollama pull qwen2.5:7b")
    except Exception as e:
        print(f"ERROR: {type(e).__name__}: {e}")

Output

=== Testing Ollama Connection ===
Model: qwen2.5:7b
Parameters: 7.61B
Quantization: Q4_0

=== Single Response ===
2 + 2 = 4.

=== Streaming Response ===
Here are three benefits of local LLM inference:

• **No internet dependency** - Your model runs entirely offline, perfect for secure environments or when connectivity is unreliable.

• **Lower latency** - Network round-trips to cloud APIs are eliminated, making response times faster for real-time applications.

• **Cost efficiency** - You pay once for hardware; no per-request API charges accumulate as you iterate during development.

What just happened?

The code opened three HTTP channels to Ollama's REST API: first to fetch model metadata (quantization level, parameter count), second to send a synchronous prompt and wait for the complete response, third to stream a response token-by-token. Each request hit localhost:11434, Ollama parsed the JSON payload, ran inference on the Qwen2.5-7B model loaded in memory, and returned either buffered JSON or newline-delimited JSON chunks.

Common gotcha

Developers forget that Ollama must be running as a separate daemon process: the Python script won't start the server automatically. If you see `ConnectionError: Failed to establish a new connection`, you need `ollama serve` running in another terminal. Also: the first query against a model is slow (model loading into VRAM), but subsequent queries are fast. Don't benchmark the first call.

Error recovery

ConnectionError: Failed to establish a new connection

Ollama daemon is not running. Execute `ollama serve` in another terminal before running your Python script. Verify with `curl http://localhost:11434/api/tags`.

requests.exceptions.Timeout

Model inference took longer than timeout (default 60s for generate, 120s for stream). Either increase timeout parameter, use a smaller model (qwen2.5:3.5b), or reduce prompt complexity.

json.JSONDecodeError on response

Ollama returned non-JSON output, usually because the model name doesn't exist locally. Run `ollama list` to check installed models, then `ollama pull qwen2.5:7b` to fetch it.

curl: (7) Failed to connect to localhost port 11434

Same root cause as ConnectionError: Ollama server not running. Confirm it's listening with `netstat -tuln | grep 11434` (Linux/Mac) or `netstat -ano | findstr 11434` (Windows).

Experienced dev note

Model quantization matters more than you think in development. Qwen2.5-7B Q4_0 (4-bit) uses ~4GB VRAM and generates ~30 tokens/sec on M1/M2 Mac; full precision uses 14GB and is rarely necessary for prompt iteration. Load `qwen2.5:3.5b` if you're memory-constrained. Also: Ollama caches the loaded model in VRAM across requests: the daemon process stays alive, so your second script invocation runs 3-5x faster than the first. This is a feature, not a bug, but it means VRAM is reserved until you restart Ollama.

Check your understanding

Why would streaming responses (stream=True) be preferable to buffered responses during development, and what would be a scenario where you'd want buffered responses instead?

Show answer hint

A correct answer distinguishes between user experience (watching tokens appear live vs. waiting for complete response) and architectural concerns (simplicity, error handling, retry logic). Buffered makes sense for validation scripts; streaming for interactive CLI tools or frontend chat interfaces.

VERSION Ollama 0.1.x had unstable API; 0.3.x (current stable) introduced the OpenAI-compatible /v1/chat/completions endpoint. The code above uses /api/generate which is stable across all versions. If using Qwen via OpenAI-compatible endpoint, use /v1/chat/completions (model format stays 'qwen2.5:7b').

Structuring prompts for Qwen reasoning: how to craft system messages and multi-turn conversations that leverage Qwen's strengths when working locally.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.