How to optimize Together AI costs
Quick answer
To optimize
Together AI costs, use smaller, efficient models like meta-llama/Llama-3.3-70B-Instruct-Turbo only when necessary, limit max_tokens, and batch requests to reduce overhead. Also, monitor usage and leverage caching to avoid redundant calls.PREREQUISITES
Python 3.8+Together AI API keypip install openai>=1.0
Setup
Install the openai Python package and set your TOGETHER_API_KEY environment variable before running the code.
pip install openai>=1.0 Step by step
Use the OpenAI-compatible SDK with base_url set to Together AI's endpoint. Limit max_tokens and choose appropriate models to reduce cost.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Explain RAG in 50 words."}],
max_tokens=100
)
print(response.choices[0].message.content) output
RAG (Retrieval-Augmented Generation) combines retrieval of relevant documents with generative models to produce accurate, context-aware responses by grounding generation in external knowledge.
Common variations
To further optimize costs, use smaller models like meta-llama/Llama-3.1-8B-Instruct for less complex tasks, and implement batching of prompts to reduce API calls. Async calls can improve throughput but do not reduce cost directly.
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["TOGETHER_API_KEY"], base_url="https://api.together.xyz/v1")
tasks = []
prompts = ["Summarize AI trends.", "Define semantic search."]
for prompt in prompts:
tasks.append(
client.chat.completions.acreate(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": prompt}],
max_tokens=80
)
)
responses = await asyncio.gather(*tasks)
for r in responses:
print(r.choices[0].message.content)
asyncio.run(main()) output
AI trends include increased use of foundation models, multimodal AI, and efficient fine-tuning techniques. Semantic search improves information retrieval by understanding intent and context rather than keyword matching.
Troubleshooting
If you encounter slow responses or high costs, check your max_tokens settings and reduce prompt length. Also, monitor your usage dashboard on Together AI to identify expensive calls. Use caching to avoid repeated queries.
Key Takeaways
- Limit
max_tokensto control token usage and reduce costs. - Choose smaller models for simpler tasks to save on API expenses.
- Batch multiple prompts in async calls to improve throughput without increasing cost.
- Monitor usage regularly to identify and optimize expensive requests.
- Cache frequent queries to avoid redundant API calls.