How to beginner · 3 min read

Groq for real-time AI applications

Quick answer
Use the openai Python SDK with base_url="https://api.groq.com/openai/v1" and your Groq API key to access Groq's low-latency models like llama-3.3-70b-versatile. Enable stream=True in chat.completions.create for real-time token streaming and fast response handling.

PREREQUISITES

  • Python 3.8+
  • Groq API key
  • pip install openai>=1.0

Setup

Install the openai Python package and set your Groq API key as an environment variable.

  • Run pip install openai to install the SDK.
  • Export your API key: export GROQ_API_KEY="your_api_key_here" on Linux/macOS or set it in your environment variables on Windows.
bash
pip install openai

Step by step

This example demonstrates a synchronous call to Groq's chat completion endpoint with streaming enabled for real-time token output.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")

messages = [
    {"role": "user", "content": "Explain the benefits of real-time AI applications."}
]

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=messages,
    stream=True
)

print("Response:")
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()
output
Response:
Real-time AI applications enable instant processing and response generation, improving user experience and enabling dynamic decision-making in critical systems.

Common variations

You can use different Groq models by changing the model parameter, such as llama-3.1-8b-instant for faster but smaller models. For asynchronous applications, use Python's async and await with the OpenAI SDK's async client methods. Streaming is supported in both sync and async modes for low-latency token delivery.

python
import asyncio
import os
from openai import OpenAI

async def async_stream():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Summarize the latest AI trends."}]
    stream = await client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=messages,
        stream=True
    )
    print("Async response:")
    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
    print()

asyncio.run(async_stream())
output
Async response:
AI trends include foundation models, multimodal learning, and real-time inference for interactive applications.

Troubleshooting

  • Authentication errors: Ensure your GROQ_API_KEY environment variable is set correctly and has valid permissions.
  • Timeouts or slow responses: Use smaller models like llama-3.1-8b-instant for faster inference.
  • Streaming issues: Confirm your environment supports streaming and you are iterating over the stream correctly.

Key Takeaways

  • Use the OpenAI SDK with Groq's base_url for seamless integration.
  • Enable streaming for real-time token-level responses in AI applications.
  • Choose models based on latency and performance needs for real-time use cases.
Verified 2026-04 · llama-3.3-70b-versatile, llama-3.1-8b-instant
Verify ↗