How to beginner · 3 min read

How to use Llama on Cerebras

Quick answer
Use the OpenAI Python SDK with the base_url set to Cerebras's API endpoint and specify a Llama model like llama3.3-70b. Instantiate the client with your Cerebras API key from os.environ, then call chat.completions.create with your messages to interact with the model.

PREREQUISITES

  • Python 3.8+
  • Cerebras API key set in environment variable CEREBRAS_API_KEY
  • pip install openai>=1.0

Setup

Install the openai Python package and set your Cerebras API key as an environment variable. Cerebras uses an OpenAI-compatible API with a custom base_url.
bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

Use the OpenAI client with base_url set to Cerebras's API endpoint. Specify the Llama model name and send chat messages. The example below sends a prompt and prints the assistant's reply.
python
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["CEREBRAS_API_KEY"],
    base_url="https://api.cerebras.ai/v1"
)

response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=[{"role": "user", "content": "Hello, how do I use Llama on Cerebras?"}]
)

print("Assistant reply:", response.choices[0].message.content)
output
Assistant reply: Hello! You can use Llama models on Cerebras by connecting via their OpenAI-compatible API endpoint and sending chat requests as shown.

Common variations

Use smaller Llama models like llama3.1-8b by changing the model parameter. For streaming responses, add stream=True and iterate over the response. Use async calls with an async OpenAI client if your environment supports it.
python
import os
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(
        api_key=os.environ["CEREBRAS_API_KEY"],
        base_url="https://api.cerebras.ai/v1"
    )

    stream = await client.chat.completions.acreate(
        model="llama3.1-8b",
        messages=[{"role": "user", "content": "Stream a response from Llama on Cerebras."}],
        stream=True
    )

    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)

asyncio.run(main())
output
Streaming assistant reply text appears token by token in the console.

Troubleshooting

If you get authentication errors, verify your CEREBRAS_API_KEY environment variable is set correctly. For network errors, check your internet connection and that https://api.cerebras.ai/v1 is reachable. If the model name is invalid, confirm you are using a supported Llama model like llama3.3-70b or llama3.1-8b.

Key Takeaways

  • Use the OpenAI SDK with Cerebras's base_url to access Llama models.
  • Set your Cerebras API key in the environment variable CEREBRAS_API_KEY.
  • Specify supported Llama model names like llama3.3-70b for chat completions.
  • Streaming and async calls are supported with the OpenAI SDK.
  • Check environment variables and network connectivity if errors occur.
Verified 2026-04 · llama3.3-70b, llama3.1-8b
Verify ↗