How to stream responses with LiteLLM
Quick answer
Use LiteLLM's Python client with the
stream=True parameter in the chat method to receive tokens as they are generated. Iterate over the streaming generator to process partial outputs in real time.PREREQUISITES
Python 3.8+pip install litellmLiteLLM server running locally or accessibleBasic knowledge of async or sync Python programming
Setup
Install the official litellm Python package and ensure you have a LiteLLM server running locally or remotely. No API key is required for local usage.
pip install litellm Step by step
This example demonstrates synchronous streaming of chat completions from LiteLLM. The stream=True argument enables token-by-token streaming. The client yields partial responses as they arrive.
from litellm import ChatClient
# Connect to local LiteLLM server (default localhost:11434)
client = ChatClient()
messages = [
{"role": "user", "content": "Write a short poem about AI."}
]
print("Streaming response:")
for chunk in client.chat(messages=messages, stream=True):
print(chunk["choices"][0]["delta"].get("content", ""), end="", flush=True)
print() output
Streaming response: AI is bright, Learning day and night, Helping humans grow, With knowledge in tow.
Common variations
- Async streaming: Use
async forwithasyncioandclient.chatin async mode. - Different models: Specify the
modelparameter if your LiteLLM server supports multiple models. - Non-streaming: Omit
stream=Trueto get the full response at once.
import asyncio
from litellm import ChatClient
async def async_stream():
client = ChatClient()
messages = [{"role": "user", "content": "Explain quantum computing briefly."}]
async for chunk in client.chat(messages=messages, stream=True):
print(chunk["choices"][0]["delta"].get("content", ""), end="", flush=True)
print()
asyncio.run(async_stream()) output
Quantum computing uses quantum bits, which can be in multiple states simultaneously, enabling powerful parallel computations.
Troubleshooting
- If streaming yields no output, verify your LiteLLM server is running and accessible on the default port
11434. - For connection errors, check firewall or network settings blocking localhost or your server address.
- If partial tokens are missing, ensure you are iterating over the streaming generator correctly and flushing output.
Key Takeaways
- Use
stream=Trueinclient.chatto enable streaming with LiteLLM. - Iterate over the returned generator to process tokens as they arrive in real time.
- LiteLLM requires no API key and runs locally by default on port 11434.
- Async streaming is supported with
async forandasyncio. - Check server connectivity if streaming does not produce output.