Code beginner · 3 min read

How to call Llama API in Python

Direct answer

Use the OpenAI-compatible OpenAI SDK with your Llama provider's base_url and API key from os.environ, then call client.chat.completions.create() with model like llama-3.3-70b-versatile and your messages.

Setup

Install

bash

pip install openai

Env vars

GROQ_API_KEY

Imports

python

import os
from openai import OpenAI

Examples

inHello, how do I use Llama models?

outLlama models are accessed via OpenAI-compatible APIs from providers like Groq or Together AI.

inGenerate a Python function to add two numbers.

outdef add(a, b): return a + b

inExplain quantum computing in simple terms.

outQuantum computing uses quantum bits that can be in multiple states simultaneously, enabling faster problem solving for certain tasks.

Integration steps

Install the OpenAI Python SDK with pip.
Set your Llama provider API key in the environment variable (e.g., GROQ_API_KEY).
Import OpenAI and initialize the client with the API key and base_url for your Llama provider.
Build the chat messages array with roles and content.
Call client.chat.completions.create() with the Llama model and messages.
Extract the response text from response.choices[0].message.content.

Full code

python

import os
from openai import OpenAI

# Initialize client with Llama provider API key and base_url
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")

messages = [
    {"role": "user", "content": "Write a Python function to add two numbers."}
]

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=messages
)

print("Response:", response.choices[0].message.content)

output

Response: def add(a, b):
    return a + b

API trace

Request

json

{"model": "llama-3.3-70b-versatile", "messages": [{"role": "user", "content": "Write a Python function to add two numbers."}]}

Response

json

{"choices": [{"message": {"content": "def add(a, b):\n    return a + b"}}], "usage": {"total_tokens": 25}}

Extractresponse.choices[0].message.content

Variants

Streaming response ›

Use streaming to display partial results immediately for long responses or better user experience.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")

messages = [{"role": "user", "content": "Explain recursion in Python."}]

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=messages,
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.get('content', ''), end='')
print()

Async call with asyncio ›

Use async calls to handle multiple concurrent requests efficiently in asynchronous applications.

python

import os
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
    messages = [{"role": "user", "content": "Summarize the benefits of AI."}]
    response = await client.chat.completions.acreate(
        model="llama-3.3-70b-versatile",
        messages=messages
    )
    print(response.choices[0].message.content)

asyncio.run(main())

Alternative provider: Together AI ›

Use this variant if you prefer Together AI's Llama hosting or want a different model variant.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")

messages = [{"role": "user", "content": "Generate a haiku about spring."}]

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=messages
)

print(response.choices[0].message.content)

Performance

Latency~1.5 to 3 seconds for llama-3.3-70b non-streaming calls

Cost~$0.10 to $0.20 per 1,000 tokens depending on provider

Rate limitsTypically 60 RPM and 60,000 TPM on default tiers; check provider docs

Limit prompt length to reduce token usage.
Use concise system and user messages.
Prefer smaller Llama variants if latency or cost is critical.

Approach	Latency	Cost/call	Best for
Standard call	~1.5-3s	~$0.15/1K tokens	General purpose, simple integration
Streaming	Starts immediately, total ~1.5-3s	Same as standard	Interactive apps needing fast partial output
Async call	~1.5-3s per call, concurrent	~$0.15/1K tokens	High concurrency or async frameworks

✓

Quick tip

Always use environment variables for your API keys and specify the provider's base_url when calling Llama models via OpenAI-compatible SDKs.

⚠

Common mistake

Forgetting to set the correct base_url for your Llama provider causes authentication or endpoint errors.

Verified 2026-04 · llama-3.3-70b-versatile, meta-llama/Llama-3.3-70B-Instruct-Turbo

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.