Rate limit errors during index creation
Why this matters
You'll hit rate limits the first time you index a few thousand documents against OpenAI or Claude. Understanding how to structure retries and batch delays saves hours of debugging and prevents production index creation from failing silently.
Explanation
What it is: Rate limiting is when an API provider temporarily rejects requests because you've exceeded your allowed throughput (requests per minute or tokens per minute). With LlamaIndex, this happens during VectorStoreIndex.from_documents() when embedding many documents in quick succession.
How it works: When you call from_documents(), LlamaIndex sends each document chunk to your configured LLM (via Settings.embed_model) for embedding. If you have 1,000 documents and send them all in parallel, the API sees 1,000 requests in seconds and returns HTTP 429 (Too Many Requests). LlamaIndex does not retry these by default: the index creation fails. You can catch the error, add delays between batches, or use built-in retry logic with exponential backoff.
When to use it: Any time you're indexing more than 50 documents at once, especially if your OpenAI account is on a free tier or standard tier (not enterprise). Always assume rate limits exist and design for them.
Analogy
Rate limiting is like a bouncer at a nightclub. You can get in, but only a few people per minute. If your entire group tries to rush the door at once, the bouncer stops you and says "come back in a few seconds." Exponential backoff is like waiting longer each time: first 1 second, then 2, then 4: until the bouncer lets you in.
Code
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import time
from openai import RateLimitError
Settings.llm = OpenAI(model="gpt-4.1")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
documents = SimpleDirectoryReader("./sample_data").load_data()
print(f"Loaded {len(documents)} documents")
def create_index_with_retry(docs, max_retries=3, initial_wait=2):
wait_time = initial_wait
for attempt in range(max_retries):
try:
print(f"Attempt {attempt + 1}: Creating index...")
index = VectorStoreIndex.from_documents(docs)
print("Index created successfully.")
return index
except RateLimitError as e:
if attempt < max_retries - 1:
print(f"Rate limit hit. Waiting {wait_time} seconds before retry...")
time.sleep(wait_time)
wait_time *= 2
else:
print(f"Max retries exceeded. Last error: {e}")
raise
index = create_index_with_retry(documents, max_retries=3, initial_wait=2)
print(f"Final index created with {len(documents)} documents.") Loaded 3 documents Attempt 1: Creating index... Index created successfully. Final index created with 3 documents.
What just happened?
The code attempted to create an index from documents. If a <code>RateLimitError</code> was raised (HTTP 429 from OpenAI), the function caught it, waited an exponentially increasing amount of time, and retried. After 3 total attempts (or on success), it either returned the index or raised the error. In this example, no rate limit was hit, so it succeeded on the first try and printed success messages.
Common gotcha
Developers assume that one failed attempt means the operation is broken. In reality, a single rate limit error is temporary and expected at scale. The gotcha is not wrapping index creation in retry logic: your code will fail unnecessarily on the first rate limit spike. The second gotcha: not using exponential backoff: if you retry with fixed 1-second delays, you'll stay in the rate limit window longer and fail more retries than needed.
Error recovery
RateLimitErrorAPIError with 'rate_limit_exceeded' messageIndex creation completes but some documents are missingExperienced dev note
Senior developers know that rate limits aren't a bug: they're a feature that protects API providers' infrastructure. The real win is exponential backoff: it's not just about waiting longer; it's about waiting in a way that respects the API's recovery time. Fixed delays of 1 second often fail because the server is still overloaded. Exponential backoff gives the server breathing room. Also: if you're indexing more than 10,000 documents regularly, switch to batch embeddings or a local model (Ollama, Hugging Face) to avoid this entirely: don't design your critical path around rate limit retries.
Check your understanding
Why does wrapping your index creation in a retry loop with fixed 1-second delays often fail to recover from rate limits, whereas exponential backoff (2s, 4s, 8s) usually succeeds?
Show answer hint
A correct answer explains that the API server itself needs time to recover and deprioritize your client after a rate limit. Fixed delays assume the server is ready immediately; exponential backoff acknowledges that you need to wait longer as backpressure increases, giving the server time to drain the queue and accept new requests.
ServiceContext was required to set embed_model. In 0.10.x+, use Settings.embed_model directly. The RateLimitError exception import from openai (not llama_index) has been stable since openai >= 1.0.0 (April 2024).