How to beginner · 3 min read

How to stream responses with LiteLLM

Q: How to stream responses with LiteLLM

Use LiteLLM's Python client with the stream=True parameter in the chat method to receive tokens as they are generated. Iterate over the streaming generator to process partial outputs in real time.

Quick answer

Use LiteLLM's Python client with the stream=True parameter in the chat method to receive tokens as they are generated. Iterate over the streaming generator to process partial outputs in real time.

PREREQUISITES

Python 3.8+
pip install litellm
LiteLLM server running locally or accessible
Basic knowledge of async or sync Python programming

Setup

Install the official litellm Python package and ensure you have a LiteLLM server running locally or remotely. No API key is required for local usage.

bash

pip install litellm

Step by step

This example demonstrates synchronous streaming of chat completions from LiteLLM. The stream=True argument enables token-by-token streaming. The client yields partial responses as they arrive.

python

from litellm import ChatClient

# Connect to local LiteLLM server (default localhost:11434)
client = ChatClient()

messages = [
    {"role": "user", "content": "Write a short poem about AI."}
]

print("Streaming response:")
for chunk in client.chat(messages=messages, stream=True):
    print(chunk["choices"][0]["delta"].get("content", ""), end="", flush=True)
print()

output

Streaming response:
AI is bright,
Learning day and night,
Helping humans grow,
With knowledge in tow.

Common variations

Async streaming: Use async for with asyncio and client.chat in async mode.
Different models: Specify the model parameter if your LiteLLM server supports multiple models.
Non-streaming: Omit stream=True to get the full response at once.

python

import asyncio
from litellm import ChatClient

async def async_stream():
    client = ChatClient()
    messages = [{"role": "user", "content": "Explain quantum computing briefly."}]
    async for chunk in client.chat(messages=messages, stream=True):
        print(chunk["choices"][0]["delta"].get("content", ""), end="", flush=True)
    print()

asyncio.run(async_stream())

output

Quantum computing uses quantum bits,
which can be in multiple states simultaneously,
enabling powerful parallel computations.

Troubleshooting

If streaming yields no output, verify your LiteLLM server is running and accessible on the default port 11434.
For connection errors, check firewall or network settings blocking localhost or your server address.
If partial tokens are missing, ensure you are iterating over the streaming generator correctly and flushing output.

✅

Key Takeaways

Use stream=True in client.chat to enable streaming with LiteLLM.
Iterate over the returned generator to process tokens as they arrive in real time.
LiteLLM requires no API key and runs locally by default on port 11434.
Async streaming is supported with async for and asyncio.
Check server connectivity if streaming does not produce output.

Verified 2026-04 · litellm

Verify ↗