How to beginner · 3 min read

How to optimize Together AI costs

Q: How to optimize Together AI costs

To optimize Together AI costs, use smaller, efficient models like meta-llama/Llama-3.3-70B-Instruct-Turbo only when necessary, limit max_tokens, and batch requests to reduce overhead. Also, monitor usage and leverage caching to avoid redundant calls.

Quick answer

To optimize Together AI costs, use smaller, efficient models like meta-llama/Llama-3.3-70B-Instruct-Turbo only when necessary, limit max_tokens, and batch requests to reduce overhead. Also, monitor usage and leverage caching to avoid redundant calls.

PREREQUISITES

Python 3.8+
Together AI API key
pip install openai>=1.0

Setup

Install the openai Python package and set your TOGETHER_API_KEY environment variable before running the code.

bash

pip install openai>=1.0

Step by step

Use the OpenAI-compatible SDK with base_url set to Together AI's endpoint. Limit max_tokens and choose appropriate models to reduce cost.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Explain RAG in 50 words."}],
    max_tokens=100
)

print(response.choices[0].message.content)

output

RAG (Retrieval-Augmented Generation) combines retrieval of relevant documents with generative models to produce accurate, context-aware responses by grounding generation in external knowledge.

Common variations

To further optimize costs, use smaller models like meta-llama/Llama-3.1-8B-Instruct for less complex tasks, and implement batching of prompts to reduce API calls. Async calls can improve throughput but do not reduce cost directly.

python

import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")
    tasks = []
    prompts = ["Summarize AI trends.", "Define semantic search."]
    for prompt in prompts:
        tasks.append(
            client.chat.completions.acreate(
                model="meta-llama/Llama-3.1-8B-Instruct",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=80
            )
        )
    responses = await asyncio.gather(*tasks)
    for r in responses:
        print(r.choices[0].message.content)

asyncio.run(main())

output

AI trends include increased use of foundation models, multimodal AI, and efficient fine-tuning techniques.
Semantic search improves information retrieval by understanding intent and context rather than keyword matching.

Troubleshooting

If you encounter slow responses or high costs, check your max_tokens settings and reduce prompt length. Also, monitor your usage dashboard on Together AI to identify expensive calls. Use caching to avoid repeated queries.

✅

Key Takeaways

Limit max_tokens to control token usage and reduce costs.
Choose smaller models for simpler tasks to save on API expenses.
Batch multiple prompts in async calls to improve throughput without increasing cost.
Monitor usage regularly to identify and optimize expensive requests.
Cache frequent queries to avoid redundant API calls.

Verified 2026-04 · meta-llama/Llama-3.3-70B-Instruct-Turbo, meta-llama/Llama-3.1-8B-Instruct

Verify ↗