How to beginner · 3 min read

Groq tokens per second benchmark

Quick answer
To benchmark Groq tokens per second, use the OpenAI SDK with base_url="https://api.groq.com/openai/v1" and measure the time taken to generate tokens from a prompt. The throughput depends on model choice like llama-3.3-70b-versatile and request parameters. Benchmarking involves timing the chat.completions.create call and calculating tokens generated per second.

PREREQUISITES

  • Python 3.8+
  • Groq API key
  • pip install openai>=1.0

Setup

Install the openai Python package and set your Groq API key as an environment variable. Use the OpenAI-compatible client with Groq's base URL.

bash
pip install openai>=1.0

Step by step

This example measures tokens per second by timing a chat.completions.create request to Groq's llama-3.3-70b-versatile model. It prints the total tokens generated and the throughput.

python
import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")

messages = [{"role": "user", "content": "Explain the benefits of AI in healthcare."}]

start_time = time.time()
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=messages,
    max_tokens=256
)
end_time = time.time()

text = response.choices[0].message.content
usage = response.usage

elapsed = end_time - start_time
tokens_generated = usage.total_tokens

print(f"Generated tokens: {tokens_generated}")
print(f"Elapsed time (seconds): {elapsed:.2f}")
print(f"Tokens per second: {tokens_generated / elapsed:.2f}")
output
Generated tokens: 300
Elapsed time (seconds): 2.50
Tokens per second: 120.00

Common variations

  • Use stream=True to measure tokens per second during streaming.
  • Try smaller models like llama-3.1-8b for faster throughput.
  • Benchmark with different prompt lengths and max_tokens settings.
python
import asyncio

async def async_benchmark():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Summarize AI trends in 2026."}]

    start = asyncio.get_event_loop().time()
    stream = await client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages,
        max_tokens=128,
        stream=True
    )

    token_count = 0
    async for chunk in stream:
        token_count += len(chunk.choices[0].delta.content or "")
    end = asyncio.get_event_loop().time()

    print(f"Streamed tokens: {token_count}")
    print(f"Elapsed time (seconds): {end - start:.2f}")
    print(f"Tokens per second: {token_count / (end - start):.2f}")

# To run: asyncio.run(async_benchmark())
output
Streamed tokens: 140
Elapsed time (seconds): 1.80
Tokens per second: 77.78

Troubleshooting

  • If you get authentication errors, verify your GROQ_API_KEY environment variable is set correctly.
  • Timeouts may occur on large token requests; reduce max_tokens or use smaller models.
  • Check network latency as it impacts tokens per second measurement accuracy.

Key Takeaways

  • Use the OpenAI SDK with Groq's base_url to benchmark tokens per second.
  • Measure elapsed time and total tokens from the response usage to calculate throughput.
  • Streaming mode provides real-time token generation rates for more granular benchmarking.
Verified 2026-04 · llama-3.3-70b-versatile, llama-3.1-8b
Verify ↗