Code Beginner easy · 5 min

Rate limit errors during index creation

What you will learn

When you embed large document batches into your index, LLM API rate limits will block you: here's how to handle it.

Why this matters

You'll hit rate limits the first time you index a few thousand documents against OpenAI or Claude. Understanding how to structure retries and batch delays saves hours of debugging and prevents production index creation from failing silently.

Skip if: If you're indexing fewer than 100 small documents, or using a local embedding model (not an API), rate limiting is not a practical concern. If your LLM provider has no rate limits or you've negotiated a high-tier plan, you may skip this entirely.

Explanation

What it is: Rate limiting is when an API provider temporarily rejects requests because you've exceeded your allowed throughput (requests per minute or tokens per minute). With LlamaIndex, this happens during VectorStoreIndex.from_documents() when embedding many documents in quick succession.

How it works: When you call from_documents(), LlamaIndex sends each document chunk to your configured LLM (via Settings.embed_model) for embedding. If you have 1,000 documents and send them all in parallel, the API sees 1,000 requests in seconds and returns HTTP 429 (Too Many Requests). LlamaIndex does not retry these by default: the index creation fails. You can catch the error, add delays between batches, or use built-in retry logic with exponential backoff.

When to use it: Any time you're indexing more than 50 documents at once, especially if your OpenAI account is on a free tier or standard tier (not enterprise). Always assume rate limits exist and design for them.

Analogy

Rate limiting is like a bouncer at a nightclub. You can get in, but only a few people per minute. If your entire group tries to rush the door at once, the bouncer stops you and says "come back in a few seconds." Exponential backoff is like waiting longer each time: first 1 second, then 2, then 4: until the bouncer lets you in.

Code

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import time
from openai import RateLimitError

Settings.llm = OpenAI(model="gpt-4.1")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

documents = SimpleDirectoryReader("./sample_data").load_data()

print(f"Loaded {len(documents)} documents")

def create_index_with_retry(docs, max_retries=3, initial_wait=2):
    wait_time = initial_wait
    for attempt in range(max_retries):
        try:
            print(f"Attempt {attempt + 1}: Creating index...")
            index = VectorStoreIndex.from_documents(docs)
            print("Index created successfully.")
            return index
        except RateLimitError as e:
            if attempt < max_retries - 1:
                print(f"Rate limit hit. Waiting {wait_time} seconds before retry...")
                time.sleep(wait_time)
                wait_time *= 2
            else:
                print(f"Max retries exceeded. Last error: {e}")
                raise

index = create_index_with_retry(documents, max_retries=3, initial_wait=2)
print(f"Final index created with {len(documents)} documents.")

Output

Loaded 3 documents
Attempt 1: Creating index...
Index created successfully.
Final index created with 3 documents.

What just happened?

The code attempted to create an index from documents. If a <code>RateLimitError</code> was raised (HTTP 429 from OpenAI), the function caught it, waited an exponentially increasing amount of time, and retried. After 3 total attempts (or on success), it either returned the index or raised the error. In this example, no rate limit was hit, so it succeeded on the first try and printed success messages.

Common gotcha

Developers assume that one failed attempt means the operation is broken. In reality, a single rate limit error is temporary and expected at scale. The gotcha is not wrapping index creation in retry logic: your code will fail unnecessarily on the first rate limit spike. The second gotcha: not using exponential backoff: if you retry with fixed 1-second delays, you'll stay in the rate limit window longer and fail more retries than needed.

Error recovery

RateLimitError

Raised when OpenAI API returns 429 (Too Many Requests). Fix: wrap index creation in a retry loop with exponential backoff (wait 2s, then 4s, then 8s, etc.). Catch <code>from openai import RateLimitError</code>.

APIError with 'rate_limit_exceeded' message

Some API errors embed rate limit info in the message. Fix: check if 'rate_limit' is in the error message and apply the same exponential backoff retry pattern.

Index creation completes but some documents are missing

This indicates partial failure during batch processing. Fix: check your retry logic only catches the outer exception; if individual document embeddings fail silently, increase max_retries or reduce batch size by using <code>VectorStoreIndex.from_documents(documents, show_progress=True)</code> to track which documents succeeded.

Experienced dev note

Senior developers know that rate limits aren't a bug: they're a feature that protects API providers' infrastructure. The real win is exponential backoff: it's not just about waiting longer; it's about waiting in a way that respects the API's recovery time. Fixed delays of 1 second often fail because the server is still overloaded. Exponential backoff gives the server breathing room. Also: if you're indexing more than 10,000 documents regularly, switch to batch embeddings or a local model (Ollama, Hugging Face) to avoid this entirely: don't design your critical path around rate limit retries.

Check your understanding

Why does wrapping your index creation in a retry loop with fixed 1-second delays often fail to recover from rate limits, whereas exponential backoff (2s, 4s, 8s) usually succeeds?

Show answer hint

A correct answer explains that the API server itself needs time to recover and deprioritize your client after a rate limit. Fixed delays assume the server is ready immediately; exponential backoff acknowledges that you need to wait longer as backpressure increases, giving the server time to drain the queue and accept new requests.

VERSION In llama-index-core < 0.10.x, ServiceContext was required to set embed_model. In 0.10.x+, use Settings.embed_model directly. The RateLimitError exception import from openai (not llama_index) has been stable since openai >= 1.0.0 (April 2024).

Once you handle rate limits, learn how to customize which fields get embedded and which are chunked: this directly reduces the number of API calls needed during index creation.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.