How to beginner · 3 min read

Cerebras for real-time AI applications

Quick answer
Use the OpenAI SDK with base_url="https://api.cerebras.ai/v1" and your CEREBRAS_API_KEY to access Cerebras models for real-time AI applications. The API supports synchronous and streaming chat completions with models like llama3.3-70b for low-latency inference.

PREREQUISITES

  • Python 3.8+
  • Cerebras API key (set as CEREBRAS_API_KEY environment variable)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your Cerebras API key as an environment variable for authentication.
bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example shows how to call the Cerebras API synchronously for a chat completion using the llama3.3-70b model, suitable for real-time AI applications.
python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")

response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=[{"role": "user", "content": "Explain real-time AI applications."}]
)

print("Response:", response.choices[0].message.content)
output
Response: Real-time AI applications process data instantly to provide immediate insights or actions, such as live chatbots, autonomous vehicles, and fraud detection.

Common variations

You can use streaming to receive tokens as they are generated for lower latency. Also, you can switch to smaller Cerebras models like llama3.1-8b for faster responses or integrate asynchronously with asyncio.
python
import os
import asyncio
from openai import OpenAI

async def stream_chat():
    client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")
    stream = await client.chat.completions.acreate(
        model="llama3.3-70b",
        messages=[{"role": "user", "content": "Stream tokens for real-time AI."}],
        stream=True
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

asyncio.run(stream_chat())
output
Stream tokens for real-time AI.

Troubleshooting

If you get authentication errors, verify your CEREBRAS_API_KEY environment variable is set correctly. For timeout issues, reduce max_tokens or switch to smaller models like llama3.1-8b. Check network connectivity to https://api.cerebras.ai/v1 if requests fail.

Key Takeaways

  • Use the OpenAI SDK with base_url set to Cerebras API endpoint for real-time AI.
  • Streaming completions reduce latency for interactive applications.
  • Choose model size based on latency and resource requirements.
  • Always set your Cerebras API key in the environment variable CEREBRAS_API_KEY.
  • Handle authentication and network errors proactively for stable real-time use.
Verified 2026-04 · llama3.3-70b, llama3.1-8b
Verify ↗