How to beginner · 3 min read

Cerebras for real-time AI applications

Quick answer

Use the OpenAI SDK with base_url="https://api.cerebras.ai/v1" and your CEREBRAS_API_KEY to access Cerebras models for real-time AI applications. The API supports synchronous and streaming chat completions with models like llama3.3-70b for low-latency inference.

PREREQUISITES

Python 3.8+
Cerebras API key (set as CEREBRAS_API_KEY environment variable)
pip install openai>=1.0

Setup

Install the openai Python package and set your Cerebras API key as an environment variable for authentication.

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example shows how to call the Cerebras API synchronously for a chat completion using the llama3.3-70b model, suitable for real-time AI applications.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")

response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=[{"role": "user", "content": "Explain real-time AI applications."}]
)

print("Response:", response.choices[0].message.content)

output

Response: Real-time AI applications process data instantly to provide immediate insights or actions, such as live chatbots, autonomous vehicles, and fraud detection.

Common variations

You can use streaming to receive tokens as they are generated for lower latency. Also, you can switch to smaller Cerebras models like llama3.1-8b for faster responses or integrate asynchronously with asyncio.

python

import os
import asyncio
from openai import OpenAI

async def stream_chat():
    client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")
    stream = await client.chat.completions.acreate(
        model="llama3.3-70b",
        messages=[{"role": "user", "content": "Stream tokens for real-time AI."}],
        stream=True
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(stream_chat())

output

Stream tokens for real-time AI.

Troubleshooting

If you get authentication errors, verify your CEREBRAS_API_KEY environment variable is set correctly. For timeout issues, reduce max_tokens or switch to smaller models like llama3.1-8b. Check network connectivity to https://api.cerebras.ai/v1 if requests fail.

✅

Key Takeaways

Use the OpenAI SDK with base_url set to Cerebras API endpoint for real-time AI.
Streaming completions reduce latency for interactive applications.
Choose model size based on latency and resource requirements.
Always set your Cerebras API key in the environment variable CEREBRAS_API_KEY.
Handle authentication and network errors proactively for stable real-time use.

Verified 2026-04 · llama3.3-70b, llama3.1-8b

Verify ↗