Code beginner · 3 min read

How to use Groq API in Python

Direct answer

Use the openai Python SDK with base_url="https://api.groq.com/openai/v1" and your GROQ_API_KEY to call client.chat.completions.create() with the desired Groq model and messages.

Setup

Install

bash

pip install openai

Env vars

GROQ_API_KEY

Imports

python

from openai import OpenAI
import os

Examples

inHello, how are you?

outHi! I'm Groq's AI model, ready to assist you.

inWrite a Python function to reverse a string.

outHere's a Python function to reverse a string: def reverse_string(s): return s[::-1]

inExplain quantum computing in simple terms.

outQuantum computing uses quantum bits that can be both 0 and 1 simultaneously, enabling faster problem solving for certain tasks.

Integration steps

Install the OpenAI Python SDK and set the GROQ_API_KEY environment variable.
Import OpenAI and initialize the client with your API key and Groq base URL.
Prepare the chat messages array with roles and content.
Call client.chat.completions.create() with the Groq model and messages.
Extract the response text from response.choices[0].message.content.
Use or display the generated text as needed.

Full code

python

from openai import OpenAI
import os

def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Hello, how are you?"}]
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages
    )
    print("Response:", response.choices[0].message.content)

if __name__ == "__main__":
    main()

output

Response: Hi! I'm Groq's AI model, ready to assist you.

API trace

Request

json

{"model": "llama-3.3-70b-versatile", "messages": [{"role": "user", "content": "Hello, how are you?"}]}

Response

json

{"choices": [{"message": {"content": "Hi! I'm Groq's AI model, ready to assist you."}}], "usage": {"prompt_tokens": 10, "completion_tokens": 12, "total_tokens": 22}}

Extractresponse.choices[0].message.content

Variants

Streaming response ›

Use streaming to display partial results as they arrive for better user experience with long outputs.

python

from openai import OpenAI
import os

def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Tell me a story."}]
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages,
        stream=True
    )
    for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

if __name__ == "__main__":
    main()

Async version ›

Use async calls when integrating Groq API in asynchronous Python applications for concurrency.

python

import asyncio
from openai import OpenAI
import os

async def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Explain AI."}]
    response = await client.chat.completions.acreate(
        model="llama-3.3-70b-versatile",
        messages=messages
    )
    print("Response:", response.choices[0].message.content)

if __name__ == "__main__":
    asyncio.run(main())

Alternative model ›

Use smaller or specialized Groq models like mixtral-8x7b-32768 for faster responses or cost savings.

python

from openai import OpenAI
import os

def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Summarize the latest tech news."}]
    response = client.chat.completions.create(
        model="mixtral-8x7b-32768",
        messages=messages
    )
    print("Summary:", response.choices[0].message.content)

if __name__ == "__main__":
    main()

Performance

Latency~700ms for llama-3.3-70b-versatile non-streaming calls

Cost~$0.003 per 500 tokens exchanged

Rate limitsTier 1: 600 RPM / 36,000 TPM

Use concise prompts to reduce token usage.
Limit max_tokens in completions to control output length.
Reuse context efficiently by summarizing prior conversation.

Approach	Latency	Cost/call	Best for
Standard call	~700ms	~$0.003	General purpose chat completions
Streaming call	~700ms initial + incremental	~$0.003	Long responses with better UX
Async call	~700ms	~$0.003	Concurrent or event-driven apps
Smaller model (mixtral-8x7b-32768)	~400ms	~$0.0015	Faster, cost-effective tasks

✓

Quick tip

Always specify the Groq base_url when initializing the OpenAI client to ensure requests route to Groq's API endpoint.

⚠

Common mistake

Forgetting to set the base_url to Groq's endpoint causes requests to default to OpenAI's API and fail authentication.

Verified 2026-04 · llama-3.3-70b-versatile, mixtral-8x7b-32768

Verify ↗