How to beginner · 3 min read

How to stream Cerebras responses

Q: How to stream Cerebras responses

Use the OpenAI Python SDK with your Cerebras API key and set stream=True in chat.completions.create. Iterate over the streamed chunks to receive partial response content in real time from the llama3.3-70b or other Cerebras models.

Quick answer

Use the OpenAI Python SDK with your Cerebras API key and set stream=True in chat.completions.create. Iterate over the streamed chunks to receive partial response content in real time from the llama3.3-70b or other Cerebras models.

PREREQUISITES

Python 3.8+
Cerebras API key
pip install openai>=1.0

Setup

Install the official openai Python package (v1 or later) and set your Cerebras API key as an environment variable. This example uses the OpenAI client with the Cerebras base URL.

bash

pip install openai>=1.0

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates streaming a chat completion from Cerebras using the llama3.3-70b model. The stream=True parameter enables real-time partial output. Each chunk's delta.content contains the incremental text.

python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CEREBRAS_API_KEY"],
    base_url="https://api.cerebras.ai/v1"
)

messages = [
    {"role": "user", "content": "Explain the benefits of streaming AI responses."}
]

stream = client.chat.completions.create(
    model="llama3.3-70b",
    messages=messages,
    stream=True
)

print("Streaming response:")
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()

output

Streaming response:
Streaming AI responses reduce latency and improve user experience by delivering partial outputs as soon as they are generated.

Common variations

Use other Cerebras models like llama3.1-8b by changing the model parameter.
For non-streaming, omit stream=True to get the full response at once.
Use async streaming with an async client and async for loop if your environment supports it.

python

import asyncio
import os
from openai import OpenAI

async def async_stream():
    client = OpenAI(
        api_key=os.environ["CEREBRAS_API_KEY"],
        base_url="https://api.cerebras.ai/v1"
    )

    messages = [
        {"role": "user", "content": "Explain the benefits of streaming AI responses asynchronously."}
    ]

    stream = await client.chat.completions.create(
        model="llama3.3-70b",
        messages=messages,
        stream=True
    )

    print("Async streaming response:")
    async for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)
    print()

asyncio.run(async_stream())

output

Async streaming response:
Streaming AI responses asynchronously allows efficient handling of real-time data with non-blocking I/O.

Troubleshooting

If you get authentication errors, verify your CEREBRAS_API_KEY environment variable is set correctly.
For connection issues, check your network and the Cerebras API status.
If streaming yields no output, confirm the stream=True parameter is set and your model supports streaming.

Key Takeaways

Use stream=True in chat.completions.create to enable streaming with Cerebras models.
Iterate over the response stream to receive partial delta.content chunks in real time.
Set base_url="https://api.cerebras.ai/v1" and use your CEREBRAS_API_KEY environment variable for authentication.
Async streaming is supported with an async client and async for loops for non-blocking applications.
Check environment variables and network connectivity if streaming does not work as expected.

Verified 2026-04 · llama3.3-70b, llama3.1-8b

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.