Code beginner · 3 min read

How to use Groq API in Python

Direct answer
Use the openai Python SDK with base_url="https://api.groq.com/openai/v1" and your GROQ_API_KEY to call client.chat.completions.create() with the desired Groq model and messages.

Setup

Install
bash
pip install openai
Env vars
GROQ_API_KEY
Imports
python
from openai import OpenAI
import os

Examples

inHello, how are you?
outHi! I'm Groq's AI model, ready to assist you.
inWrite a Python function to reverse a string.
outHere's a Python function to reverse a string: def reverse_string(s): return s[::-1]
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be both 0 and 1 simultaneously, enabling faster problem solving for certain tasks.

Integration steps

  1. Install the OpenAI Python SDK and set the GROQ_API_KEY environment variable.
  2. Import OpenAI and initialize the client with your API key and Groq base URL.
  3. Prepare the chat messages array with roles and content.
  4. Call client.chat.completions.create() with the Groq model and messages.
  5. Extract the response text from response.choices[0].message.content.
  6. Use or display the generated text as needed.

Full code

python
from openai import OpenAI
import os

def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Hello, how are you?"}]
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages
    )
    print("Response:", response.choices[0].message.content)

if __name__ == "__main__":
    main()
output
Response: Hi! I'm Groq's AI model, ready to assist you.

API trace

Request
json
{"model": "llama-3.3-70b-versatile", "messages": [{"role": "user", "content": "Hello, how are you?"}]}
Response
json
{"choices": [{"message": {"content": "Hi! I'm Groq's AI model, ready to assist you."}}], "usage": {"prompt_tokens": 10, "completion_tokens": 12, "total_tokens": 22}}
Extractresponse.choices[0].message.content

Variants

Streaming response

Use streaming to display partial results as they arrive for better user experience with long outputs.

python
from openai import OpenAI
import os

def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Tell me a story."}]
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages,
        stream=True
    )
    for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)

if __name__ == "__main__":
    main()
Async version

Use async calls when integrating Groq API in asynchronous Python applications for concurrency.

python
import asyncio
from openai import OpenAI
import os

async def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Explain AI."}]
    response = await client.chat.completions.acreate(
        model="llama-3.3-70b-versatile",
        messages=messages
    )
    print("Response:", response.choices[0].message.content)

if __name__ == "__main__":
    asyncio.run(main())
Alternative model

Use smaller or specialized Groq models like mixtral-8x7b-32768 for faster responses or cost savings.

python
from openai import OpenAI
import os

def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Summarize the latest tech news."}]
    response = client.chat.completions.create(
        model="mixtral-8x7b-32768",
        messages=messages
    )
    print("Summary:", response.choices[0].message.content)

if __name__ == "__main__":
    main()

Performance

Latency~700ms for llama-3.3-70b-versatile non-streaming calls
Cost~$0.003 per 500 tokens exchanged
Rate limitsTier 1: 600 RPM / 36,000 TPM
  • Use concise prompts to reduce token usage.
  • Limit max_tokens in completions to control output length.
  • Reuse context efficiently by summarizing prior conversation.
ApproachLatencyCost/callBest for
Standard call~700ms~$0.003General purpose chat completions
Streaming call~700ms initial + incremental~$0.003Long responses with better UX
Async call~700ms~$0.003Concurrent or event-driven apps
Smaller model (mixtral-8x7b-32768)~400ms~$0.0015Faster, cost-effective tasks

Quick tip

Always specify the Groq base_url when initializing the OpenAI client to ensure requests route to Groq's API endpoint.

Common mistake

Forgetting to set the base_url to Groq's endpoint causes requests to default to OpenAI's API and fail authentication.

Verified 2026-04 · llama-3.3-70b-versatile, mixtral-8x7b-32768
Verify ↗