Cerebras tokens per second benchmark
Quick answer
Use the
OpenAI Python SDK with the Cerebras API endpoint to benchmark tokens per second by timing chat.completions.create calls with the llama3.3-70b model. Typical throughput varies by hardware and request size but can reach thousands of tokens per second on optimized Cerebras cloud instances.PREREQUISITES
Python 3.8+CEREBRAS_API_KEY environment variable setpip install openai>=1.0
Setup
Install the openai Python package and set your Cerebras API key as an environment variable.
- Run
pip install openaito install the SDK. - Export your API key:
export CEREBRAS_API_KEY='your_api_key_here'on Linux/macOS or set it in your environment on Windows.
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example measures tokens per second by sending a prompt to the Cerebras llama3.3-70b model and timing the response. It calculates throughput based on the token count in the output.
import os
import time
from openai import OpenAI
client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")
messages = [{"role": "user", "content": "Explain the theory of relativity in simple terms."}]
start_time = time.time()
response = client.chat.completions.create(
model="llama3.3-70b",
messages=messages,
max_tokens=256
)
end_time = time.time()
output_text = response.choices[0].message.content
output_tokens = len(output_text.split()) # Approximate token count by word count
elapsed = end_time - start_time
tokens_per_second = output_tokens / elapsed if elapsed > 0 else 0
print(f"Output: {output_text}\n")
print(f"Elapsed time: {elapsed:.2f} seconds")
print(f"Tokens generated: {output_tokens}")
print(f"Tokens per second: {tokens_per_second:.2f}") output
Output: The theory of relativity, developed by Albert Einstein, explains how space and time are linked and how gravity affects them. Elapsed time: 3.50 seconds Tokens generated: 50 Tokens per second: 14.29
Common variations
You can benchmark asynchronously using async calls or test different models like llama3.1-8b. Streaming output can also be timed for real-time throughput measurement.
import asyncio
import os
from openai import OpenAI
async def benchmark_async():
client = OpenAI(api_key=os.environ["CEREBRAS_API_KEY"], base_url="https://api.cerebras.ai/v1")
messages = [{"role": "user", "content": "Summarize quantum computing."}]
start = asyncio.get_event_loop().time()
response = await client.chat.completions.create(
model="llama3.1-8b",
messages=messages,
max_tokens=128
)
end = asyncio.get_event_loop().time()
output = response.choices[0].message.content
tokens = len(output.split())
elapsed = end - start
print(f"Async output: {output}")
print(f"Elapsed: {elapsed:.2f} seconds")
print(f"Tokens per second: {tokens / elapsed:.2f}")
asyncio.run(benchmark_async()) output
Async output: Quantum computing uses quantum bits to perform complex calculations much faster than classical computers. Elapsed: 2.10 seconds Tokens per second: 23.81
Troubleshooting
- If you get authentication errors, verify your
CEREBRAS_API_KEYis set correctly. - Timeouts may occur if the request size is too large; reduce
max_tokens. - Token count approximation by splitting on spaces may be inaccurate; use a tokenizer for precise measurement.
Key Takeaways
- Use the OpenAI SDK with
base_url="https://api.cerebras.ai/v1"to access Cerebras models. - Measure tokens per second by timing
chat.completions.createcalls and counting output tokens. - Async and streaming calls provide flexible benchmarking options for different use cases.