Groq tokens per second benchmark
Quick answer
To benchmark
Groq tokens per second, use the OpenAI SDK with base_url="https://api.groq.com/openai/v1" and measure the time taken to generate tokens from a prompt. The throughput depends on model choice like llama-3.3-70b-versatile and request parameters. Benchmarking involves timing the chat.completions.create call and calculating tokens generated per second.PREREQUISITES
Python 3.8+Groq API keypip install openai>=1.0
Setup
Install the openai Python package and set your Groq API key as an environment variable. Use the OpenAI-compatible client with Groq's base URL.
pip install openai>=1.0 Step by step
This example measures tokens per second by timing a chat.completions.create request to Groq's llama-3.3-70b-versatile model. It prints the total tokens generated and the throughput.
import os
import time
from openai import OpenAI
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
messages = [{"role": "user", "content": "Explain the benefits of AI in healthcare."}]
start_time = time.time()
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=messages,
max_tokens=256
)
end_time = time.time()
text = response.choices[0].message.content
usage = response.usage
elapsed = end_time - start_time
tokens_generated = usage.total_tokens
print(f"Generated tokens: {tokens_generated}")
print(f"Elapsed time (seconds): {elapsed:.2f}")
print(f"Tokens per second: {tokens_generated / elapsed:.2f}") output
Generated tokens: 300 Elapsed time (seconds): 2.50 Tokens per second: 120.00
Common variations
- Use
stream=Trueto measure tokens per second during streaming. - Try smaller models like
llama-3.1-8bfor faster throughput. - Benchmark with different prompt lengths and
max_tokenssettings.
import asyncio
async def async_benchmark():
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
messages = [{"role": "user", "content": "Summarize AI trends in 2026."}]
start = asyncio.get_event_loop().time()
stream = await client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=messages,
max_tokens=128,
stream=True
)
token_count = 0
async for chunk in stream:
token_count += len(chunk.choices[0].delta.content or "")
end = asyncio.get_event_loop().time()
print(f"Streamed tokens: {token_count}")
print(f"Elapsed time (seconds): {end - start:.2f}")
print(f"Tokens per second: {token_count / (end - start):.2f}")
# To run: asyncio.run(async_benchmark()) output
Streamed tokens: 140 Elapsed time (seconds): 1.80 Tokens per second: 77.78
Troubleshooting
- If you get authentication errors, verify your
GROQ_API_KEYenvironment variable is set correctly. - Timeouts may occur on large token requests; reduce
max_tokensor use smaller models. - Check network latency as it impacts tokens per second measurement accuracy.
Key Takeaways
- Use the OpenAI SDK with Groq's base_url to benchmark tokens per second.
- Measure elapsed time and total tokens from the response usage to calculate throughput.
- Streaming mode provides real-time token generation rates for more granular benchmarking.