How to beginner · 3 min read

Cerebras tokens per second benchmark

Quick answer
Use the OpenAI Python SDK with the Cerebras API endpoint to benchmark tokens per second by timing chat.completions.create calls with the llama3.3-70b model. Typical throughput varies by hardware and request size but can reach thousands of tokens per second on optimized Cerebras cloud instances.

PREREQUISITES

  • Python 3.8+
  • CEREBRAS_API_KEY environment variable set
  • pip install openai>=1.0

Setup

Install the openai Python package and set your Cerebras API key as an environment variable.

  • Run pip install openai to install the SDK.
  • Export your API key: export CEREBRAS_API_KEY='your_api_key_here' on Linux/macOS or set it in your environment on Windows.
bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example measures tokens per second by sending a prompt to the Cerebras llama3.3-70b model and timing the response. It calculates throughput based on the token count in the output.

python
import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")

messages = [{"role": "user", "content": "Explain the theory of relativity in simple terms."}]

start_time = time.time()
response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=messages,
    max_tokens=256
)
end_time = time.time()

output_text = response.choices[0].message.content
output_tokens = len(output_text.split())  # Approximate token count by word count
elapsed = end_time - start_time

tokens_per_second = output_tokens / elapsed if elapsed > 0 else 0

print(f"Output: {output_text}\n")
print(f"Elapsed time: {elapsed:.2f} seconds")
print(f"Tokens generated: {output_tokens}")
print(f"Tokens per second: {tokens_per_second:.2f}")
output
Output: The theory of relativity, developed by Albert Einstein, explains how space and time are linked and how gravity affects them.

Elapsed time: 3.50 seconds
Tokens generated: 50
Tokens per second: 14.29

Common variations

You can benchmark asynchronously using async calls or test different models like llama3.1-8b. Streaming output can also be timed for real-time throughput measurement.

python
import asyncio
import os
from openai import OpenAI

async def benchmark_async():
    client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")
    messages = [{"role": "user", "content": "Summarize quantum computing."}]

    start = asyncio.get_event_loop().time()
    response = await client.chat.completions.create(
        model="llama3.1-8b",
        messages=messages,
        max_tokens=128
    )
    end = asyncio.get_event_loop().time()

    output = response.choices[0].message.content
    tokens = len(output.split())
    elapsed = end - start
    print(f"Async output: {output}")
    print(f"Elapsed: {elapsed:.2f} seconds")
    print(f"Tokens per second: {tokens / elapsed:.2f}")

asyncio.run(benchmark_async())
output
Async output: Quantum computing uses quantum bits to perform complex calculations much faster than classical computers.
Elapsed: 2.10 seconds
Tokens per second: 23.81

Troubleshooting

  • If you get authentication errors, verify your CEREBRAS_API_KEY is set correctly.
  • Timeouts may occur if the request size is too large; reduce max_tokens.
  • Token count approximation by splitting on spaces may be inaccurate; use a tokenizer for precise measurement.

Key Takeaways

  • Use the OpenAI SDK with base_url="https://api.cerebras.ai/v1" to access Cerebras models.
  • Measure tokens per second by timing chat.completions.create calls and counting output tokens.
  • Async and streaming calls provide flexible benchmarking options for different use cases.
Verified 2026-04 · llama3.3-70b, llama3.1-8b
Verify ↗