Code intermediate · 3 min read

How to stream Ollama response in python

Direct answer

Use the Ollama Python SDK's streaming client method to receive partial responses in real time by iterating over the stream generator from ollama.chat(model="llama2", messages=[...]).

Setup

Install

bash

pip install ollama

Imports

python

import ollama

Examples

inTell me a joke about cats.

outWhy don't cats play poker in the jungle? Too many cheetahs!

inExplain quantum computing in simple terms.

outQuantum computing uses quantum bits that can be both 0 and 1 at the same time, enabling faster problem solving.

inSummarize the plot of 'The Great Gatsby'.

outA mysterious millionaire, Gatsby, pursues his lost love Daisy in 1920s New York, leading to tragedy.

Integration steps

Install the Ollama Python SDK.
Import the ollama module.
Call the streaming chat method with your prompt and model name.
Iterate over the streaming response generator to receive partial message chunks.
Concatenate or process the streamed chunks as they arrive for real-time display.
Handle any exceptions or stream termination gracefully.

Full code

python

import ollama

# Define the prompt and model
prompt = "Tell me a joke about cats."
model = "llama2"

print("Streaming response:")
response_text = ""

# Stream the response from Ollama
for chunk in ollama.chat(model=model, messages=[{"role": "user", "content": prompt}], stream=True):
    # Each chunk is a partial message content
    partial = chunk.get("content", "")
    print(partial, end="", flush=True)
    response_text += partial

print("\nFull response received.")

output

Streaming response:
Why don't cats play poker in the jungle? Too many cheetahs!
Full response received.

API trace

Request

json

{"model": "llama2", "messages": [{"role": "user", "content": "Tell me a joke about cats."}], "stream": true}

Response

json

{"content": "partial text chunk"}

ExtractIterate over the stream and concatenate chunk["content"]

Variants

Non-streaming synchronous call ›

Use when you want the full response at once without streaming, simpler for short prompts.

python

import ollama

response = ollama.chat(model="llama2", messages=[{"role": "user", "content": "Tell me a joke about cats."}])
print(response["content"])

Async streaming with asyncio ›

Use for concurrent applications or when integrating with async frameworks.

python

import asyncio
import ollama

async def stream_response():
    async for chunk in ollama.chat(model="llama2", messages=[{"role": "user", "content": "Tell me a joke about cats."}], stream=True):
        partial = chunk.get("content", "")
        print(partial, end="", flush=True)

asyncio.run(stream_response())

Performance

Latency~500ms to start streaming for typical Ollama models

CostFree; Ollama runs locally with no cloud pricing

Rate limitsNo rate limits; runs locally on your hardware

Keep prompts concise to reduce token usage.
Use streaming to start processing output early and reduce perceived latency.
Cache frequent queries to avoid repeated token consumption.

Approach	Latency	Cost/call	Best for
Streaming	~500ms start + incremental	Free	Real-time UI updates
Non-streaming	~1-2s full response	Free	Simple scripts or batch processing
Async streaming	~500ms start + incremental	Free	Concurrent or async apps

✓

Quick tip

Always iterate over the streaming generator to handle partial responses for a responsive user experience.

⚠

Common mistake

Beginners often forget to check for the 'content' key in the chunk object, causing KeyError during streaming.

Verified 2026-04 · llama2

Verify ↗