Code beginner · 3 min read

How to use Together AI API in Python

Direct answer
Use the openai Python SDK with base_url="https://api.together.xyz/v1" and your TOGETHER_API_KEY to call client.chat.completions.create() with your model and messages.

Setup

Install
bash
pip install openai
Env vars
TOGETHER_API_KEY
Imports
python
from openai import OpenAI
import os

Examples

inHello, how are you?
outI'm doing great, thanks for asking! How can I assist you today?
inWrite a Python function to reverse a string.
outHere's a Python function to reverse a string: ```python def reverse_string(s): return s[::-1] ```
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be both 0 and 1 at the same time, enabling powerful computations beyond classical computers.

Integration steps

  1. Install the OpenAI Python SDK with pip and set the TOGETHER_API_KEY environment variable.
  2. Import the OpenAI client and initialize it with your API key and Together AI base URL.
  3. Build the messages list with roles and content for the chat completion.
  4. Call client.chat.completions.create() with the Together AI model and messages.
  5. Extract the response text from response.choices[0].message.content.
  6. Use or display the generated text as needed.

Full code

python
from openai import OpenAI
import os

# Initialize Together AI client with API key and base URL
client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")

# Prepare chat messages
messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

# Create chat completion
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=messages
)

# Extract and print the response text
print("Response:", response.choices[0].message.content)
output
Response: I'm doing great, thanks for asking! How can I assist you today?

API trace

Request
json
{"model": "meta-llama/Llama-3.3-70B-Instruct-Turbo", "messages": [{"role": "user", "content": "Hello, how are you?"}]}
Response
json
{"choices": [{"message": {"content": "I'm doing great, thanks for asking! How can I assist you today?"}}], "usage": {"prompt_tokens": 10, "completion_tokens": 20, "total_tokens": 30}}
Extractresponse.choices[0].message.content

Variants

Streaming chat completion

Use streaming to display partial results in real-time for better user experience with long responses.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")

messages = [{"role": "user", "content": "Tell me a story."}]

stream = client.chat.completions.create(model="meta-llama/Llama-3.3-70B-Instruct-Turbo", messages=messages, stream=True)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()
Async chat completion

Use async calls when integrating into asynchronous applications or frameworks to improve concurrency.

python
import asyncio
from openai import OpenAI
import os

async def main():
    client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")
    messages = [{"role": "user", "content": "Explain AI."}]
    response = await client.chat.completions.acreate(model="meta-llama/Llama-3.3-70B-Instruct-Turbo", messages=messages)
    print("Async response:", response.choices[0].message.content)

asyncio.run(main())
Use smaller model for faster, cheaper calls

Use smaller models for lower latency and cost when high accuracy or detail is not critical.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")
messages = [{"role": "user", "content": "Summarize the latest news."}]
response = client.chat.completions.create(model="meta-llama/Llama-3.1-8b-instruct", messages=messages)
print("Summary:", response.choices[0].message.content)

Performance

Latency~1-2 seconds per request for large models like Llama-3.3-70B-Instruct-Turbo
Cost~$0.03 to $0.10 per 1,000 tokens depending on model size
Rate limitsDefault tier: 60 requests per minute, 100,000 tokens per day (check Together AI docs for updates)
  • Use smaller models for less token consumption and faster responses.
  • Limit prompt length by summarizing or truncating input.
  • Cache frequent queries to avoid repeated calls.
ApproachLatencyCost/callBest for
Standard call~1-2s~$0.05General purpose chat completions
StreamingStarts immediately, total ~1-2s~$0.05Real-time UI with long outputs
Async call~1-2s~$0.05Concurrent or async apps
Smaller model~0.5-1s~$0.01Faster, cheaper, less detailed responses

Quick tip

Always set the base_url to Together AI's endpoint and use your TOGETHER_API_KEY from environment variables to authenticate.

Common mistake

Forgetting to set the base_url to Together AI's API endpoint causes authentication errors or default OpenAI calls.

Verified 2026-04 · meta-llama/Llama-3.3-70B-Instruct-Turbo, meta-llama/Llama-3.1-8b-instruct
Verify ↗