How to handle classification at scale
api_error Why this happens
When performing classification at scale, sending one request per input causes excessive API calls, triggering RateLimitError or timeouts. For example, naive code that calls client.chat.completions.create in a loop for thousands of inputs will hit API rate limits and degrade performance.
Typical error output:
openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
inputs = ["Text 1", "Text 2", "Text 3", ...] # thousands of texts
results = []
for text in inputs:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Classify: {text}"}]
)
results.append(response.choices[0].message.content)
print(results[:3]) openai.error.RateLimitError: You have exceeded your current quota, please check your plan and billing details.
The fix
Batch inputs to reduce the number of API calls by sending multiple classification requests in one prompt. Use asynchronous concurrency to parallelize batches. Add retry logic with exponential backoff to handle transient rate limits.
This example batches inputs in groups of 10, sends them in one request, and retries on rate limit errors.
from openai import OpenAI
import os
import asyncio
import backoff
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
inputs = [f"Text {i}" for i in range(1000)] # large dataset
batch_size = 10
@backoff.on_exception(backoff.expo, Exception, max_tries=5)
async def classify_batch(batch):
prompt = "\n".join([f"Classify: {text}" for text in batch])
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def main():
tasks = []
for i in range(0, len(inputs), batch_size):
batch = inputs[i:i+batch_size]
tasks.append(classify_batch(batch))
results = await asyncio.gather(*tasks)
print(results[:3])
if __name__ == "__main__":
asyncio.run(main()) ["Positive\nNegative\nNeutral\n...", "Positive\nPositive\nNegative\n...", "Neutral\nNeutral\nPositive\n..."]
Preventing it in production
Implement robust retry policies with exponential backoff to handle RateLimitError and transient network issues. Use batching to minimize API calls and concurrency to maximize throughput without exceeding rate limits.
Validate input sizes and model token limits to avoid request rejections. Monitor API usage and set alerts for quota exhaustion. Consider fallback models or caching frequent classifications to reduce load.
Key Takeaways
- Batch multiple classification inputs per API call to reduce request volume.
- Use asynchronous concurrency to parallelize batches and improve throughput.
- Implement exponential backoff retries to handle rate limits and transient errors.