How to beginner · 3 min read

Cerebras tokens per second benchmark

Q: Cerebras tokens per second benchmark

Use the OpenAI Python SDK with the Cerebras API endpoint to benchmark tokens per second by timing chat.completions.create calls with the llama3.3-70b model. Typical throughput varies by hardware and request size but can reach thousands of tokens per second on optimized Cerebras cloud instances.

Quick answer

Use the OpenAI Python SDK with the Cerebras API endpoint to benchmark tokens per second by timing chat.completions.create calls with the llama3.3-70b model. Typical throughput varies by hardware and request size but can reach thousands of tokens per second on optimized Cerebras cloud instances.

PREREQUISITES

Python 3.8+
CEREBRAS_API_KEY environment variable set
pip install openai>=1.0

Setup

Install the openai Python package and set your Cerebras API key as an environment variable.

Run pip install openai to install the SDK.
Export your API key: export CEREBRAS_API_KEY='your_api_key_here' on Linux/macOS or set it in your environment on Windows.

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example measures tokens per second by sending a prompt to the Cerebras llama3.3-70b model and timing the response. It calculates throughput based on the token count in the output.

python

import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")

messages = [{"role": "user", "content": "Explain the theory of relativity in simple terms."}]

start_time = time.time()
response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=messages,
    max_tokens=256
)
end_time = time.time()

output_text = response.choices[0].message.content
output_tokens = len(output_text.split())  # Approximate token count by word count
elapsed = end_time - start_time

tokens_per_second = output_tokens / elapsed if elapsed > 0 else 0

print(f"Output: {output_text}\n")
print(f"Elapsed time: {elapsed:.2f} seconds")
print(f"Tokens generated: {output_tokens}")
print(f"Tokens per second: {tokens_per_second:.2f}")

output

Output: The theory of relativity, developed by Albert Einstein, explains how space and time are linked and how gravity affects them.

Elapsed time: 3.50 seconds
Tokens generated: 50
Tokens per second: 14.29

Common variations

You can benchmark asynchronously using async calls or test different models like llama3.1-8b. Streaming output can also be timed for real-time throughput measurement.

python

import asyncio
import os
from openai import OpenAI

async def benchmark_async():
    client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")
    messages = [{"role": "user", "content": "Summarize quantum computing."}]

    start = asyncio.get_event_loop().time()
    response = await client.chat.completions.create(
        model="llama3.1-8b",
        messages=messages,
        max_tokens=128
    )
    end = asyncio.get_event_loop().time()

    output = response.choices[0].message.content
    tokens = len(output.split())
    elapsed = end - start
    print(f"Async output: {output}")
    print(f"Elapsed: {elapsed:.2f} seconds")
    print(f"Tokens per second: {tokens / elapsed:.2f}")

asyncio.run(benchmark_async())

output

Async output: Quantum computing uses quantum bits to perform complex calculations much faster than classical computers.
Elapsed: 2.10 seconds
Tokens per second: 23.81

Troubleshooting

If you get authentication errors, verify your CEREBRAS_API_KEY is set correctly.
Timeouts may occur if the request size is too large; reduce max_tokens.
Token count approximation by splitting on spaces may be inaccurate; use a tokenizer for precise measurement.

✅

Key Takeaways

Use the OpenAI SDK with base_url="https://api.cerebras.ai/v1" to access Cerebras models.
Measure tokens per second by timing chat.completions.create calls and counting output tokens.
Async and streaming calls provide flexible benchmarking options for different use cases.

Verified 2026-04 · llama3.3-70b, llama3.1-8b

Verify ↗