How to call Llama API in Python
Direct answer
Use the OpenAI-compatible OpenAI SDK with your Llama provider's base_url and API key from os.environ, then call client.chat.completions.create() with model like llama-3.3-70b-versatile and your messages.
Setup
Install
pip install openai Env vars
GROQ_API_KEY Imports
import os
from openai import OpenAI Examples
inHello, how do I use Llama models?
outLlama models are accessed via OpenAI-compatible APIs from providers like Groq or Together AI.
inGenerate a Python function to add two numbers.
outdef add(a, b):
return a + b
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be in multiple states simultaneously, enabling faster problem solving for certain tasks.
Integration steps
- Install the OpenAI Python SDK with pip.
- Set your Llama provider API key in the environment variable (e.g., GROQ_API_KEY).
- Import OpenAI and initialize the client with the API key and base_url for your Llama provider.
- Build the chat messages array with roles and content.
- Call client.chat.completions.create() with the Llama model and messages.
- Extract the response text from response.choices[0].message.content.
Full code
import os
from openai import OpenAI
# Initialize client with Llama provider API key and base_url
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
messages = [
{"role": "user", "content": "Write a Python function to add two numbers."}
]
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=messages
)
print("Response:", response.choices[0].message.content) output
Response: def add(a, b):
return a + b API trace
Request
{"model": "llama-3.3-70b-versatile", "messages": [{"role": "user", "content": "Write a Python function to add two numbers."}]} Response
{"choices": [{"message": {"content": "def add(a, b):\n return a + b"}}], "usage": {"total_tokens": 25}} Extract
response.choices[0].message.contentVariants
Streaming response ›
Use streaming to display partial results immediately for long responses or better user experience.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
messages = [{"role": "user", "content": "Explain recursion in Python."}]
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=messages,
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.get('content', ''), end='')
print() Async call with asyncio ›
Use async calls to handle multiple concurrent requests efficiently in asynchronous applications.
import os
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["GROQ_API_KEY"], base_url="https://api.groq.com/openai/v1")
messages = [{"role": "user", "content": "Summarize the benefits of AI."}]
response = await client.chat.completions.acreate(
model="llama-3.3-70b-versatile",
messages=messages
)
print(response.choices[0].message.content)
asyncio.run(main()) Alternative provider: Together AI ›
Use this variant if you prefer Together AI's Llama hosting or want a different model variant.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")
messages = [{"role": "user", "content": "Generate a haiku about spring."}]
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=messages
)
print(response.choices[0].message.content) Performance
Latency~1.5 to 3 seconds for llama-3.3-70b non-streaming calls
Cost~$0.10 to $0.20 per 1,000 tokens depending on provider
Rate limitsTypically 60 RPM and 60,000 TPM on default tiers; check provider docs
- Limit prompt length to reduce token usage.
- Use concise system and user messages.
- Prefer smaller Llama variants if latency or cost is critical.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard call | ~1.5-3s | ~$0.15/1K tokens | General purpose, simple integration |
| Streaming | Starts immediately, total ~1.5-3s | Same as standard | Interactive apps needing fast partial output |
| Async call | ~1.5-3s per call, concurrent | ~$0.15/1K tokens | High concurrency or async frameworks |
Quick tip
Always use environment variables for your API keys and specify the provider's base_url when calling Llama models via OpenAI-compatible SDKs.
Common mistake
Forgetting to set the correct base_url for your Llama provider causes authentication or endpoint errors.