Code Advanced hard · 8 min

Callback manager: custom event hooks

What you will learn
Hook into internal LlamaIndex events to monitor, log, or transform what happens during indexing, retrieval, and LLM calls.

Why this matters

In production RAG systems, you need visibility into what the framework is doing under the hood: token usage, latency, retrieval decisions, LLM input/output: without modifying your core indexing or query logic. Callbacks let you inject observability, caching, rate limiting, or custom filtering at the right moment.

Skip if: Do NOT use callbacks for synchronous data transformation. If you need to modify documents before indexing, use preprocessing in your pipeline instead. Do NOT use callbacks as a substitute for proper error handling: they run after events, not instead of them. Do NOT create callbacks that perform heavy I/O (database writes, API calls) on every token: batch or debounce instead.

Explanation

What it is: The CallbackManager in llama-index-core is a pub-sub system that fires events at specific lifecycle points: before/after LLM calls, before/after embedding calls, before/after node retrieval, and during chat operations. You register custom handlers that execute when those events occur.

How it works mechanically: Every major LlamaIndex operation (indexing, querying, chat) accepts a callback_manager parameter. When that operation reaches an instrumented point, it publishes an event object containing context (node text, LLM response, tokens used, etc.). Your callback handlers receive that event, inspect/modify state, and return. The event flows through all registered handlers in order. Handlers are registered via callback_manager.on_event_start(), callback_manager.on_event_end(), and similar methods for specific event types.

When to use it: Use callbacks when you need to: (1) measure latency and token usage across retrieval and LLM calls, (2) implement custom logging/tracing for debugging, (3) enforce rate limits or quotas, (4) cache expensive computations, (5) validate or filter results before they propagate downstream.

Analogy

Think of it like middleware in Express.js or FastAPI. Requests flow through your middleware stack, each layer can log/modify/reject before passing to the next layer. Except here, the 'requests' are internal LlamaIndex operations, and you're tapping into them without rewriting the core functions.

Code

Illustrative only - not runnable without a valid API key
python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.callbacks import CallbackManager, CBEventType
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.callbacks.base import BaseCallbackHandler
from llama_index.core.callbacks.schema import CBEvent
import time

class LatencyTracker(BaseCallbackHandler):
    """Track latency of LLM and embedding calls."""
    
    def __init__(self):
        self.events = []
        self.start_times = {}
    
    def on_event_start(self, event: CBEvent) -> None:
        """Called when an event starts."""
        event_id = id(event)
        self.start_times[event_id] = time.time()
        print(f"[START] {event.event_type}: {event.payload.get('model_name', 'unknown')}")
    
    def on_event_end(self, event: CBEvent) -> None:
        """Called when an event ends."""
        event_id = id(event)
        elapsed = time.time() - self.start_times.get(event_id, time.time())
        tokens = event.payload.get('tokens', 0)
        print(f"[END] {event.event_type}: {elapsed:.2f}s (tokens: {tokens})")
        self.events.append({
            'event_type': event.event_type,
            'elapsed': elapsed,
            'tokens': tokens
        })

class FilterProfanity(BaseCallbackHandler):
    """Filter profanity from LLM responses."""
    
    def on_event_end(self, event: CBEvent) -> None:
        if event.event_type == CBEventType.LLM_PREDICTION_END:
            response = event.payload.get('response')
            if response and hasattr(response, 'message'):
                text = response.message.content
                cleaned = text.replace('bad_word', '***')
                print(f"[FILTER] Original length: {len(text)}, Cleaned: {cleaned}")

callback_manager = CallbackManager(
    handlers=[
        LatencyTracker(),
        FilterProfanity()
    ]
)

Settings.llm = OpenAI(model='gpt-4.1')
Settings.embed_model = OpenAIEmbedding(model='text-embedding-3-small')
Settings.callback_manager = callback_manager

print("\n=== Creating index with callbacks ===")
documents = SimpleDirectoryReader(
    input_files=['sample.txt']
).load_data()

index = VectorStoreIndex.from_documents(
    documents,
    callback_manager=callback_manager
)

print("\n=== Querying with callbacks ===")
query_engine = index.as_query_engine(
    callback_manager=callback_manager
)
response = query_engine.query('What is this document about?')

print(f"\nFinal response: {response}")
Output
[START] llm_prediction: gpt-4.1
[END] llm_prediction: 0.85s (tokens: 256)
[FILTER] Original length: 248, Cleaned: 248
[START] embedding: text-embedding-3-small
[END] embedding: 0.32s (tokens: 0)

=== Creating index with callbacks ==="
[START] embedding_prediction: text-embedding-3-small
[END] embedding_prediction: 1.20s (tokens: 512)

=== Querying with callbacks ==="
[START] retrieval: vector_store
[END] retrieval: 0.15s (tokens: 0)
[START] llm_prediction: gpt-4.1
[END] llm_prediction: 0.92s (tokens: 340)
[FILTER] Original length: 312, Cleaned: 312

Final response: The document discusses machine learning fundamentals...

What just happened?

We registered two callback handlers (LatencyTracker and FilterProfanity) into a CallbackManager. When we indexed documents and queried, the framework fired events at each lifecycle point (LLM call, embedding, retrieval). Our handlers' <code>on_event_start()</code> methods ran when each operation began, logging the event type and storing the start time. When operations completed, <code>on_event_end()</code> executed, calculating elapsed time, extracting token counts from the event payload, and logging the result. The FilterProfanity handler also inspected the LLM response text within the event object.

Common gotcha

The most common mistake is assuming the event object in your handler has direct attributes like event.tokens or event.response. It doesn't. You must access them via event.payload.get('key'), and the keys vary by event type. LLM events have 'tokens' and 'response'; embedding events have different keys. Print event.payload during development to discover what's actually available. Also, if you modify the event payload in a handler, those changes do NOT propagate downstream: callbacks observe and log, they don't transform the actual execution.

Error recovery

KeyError on event.payload.get()
The event type you're inspecting doesn't contain that key. Use <code>event.event_type</code> to filter which handlers process which events. Add <code>if event.event_type == CBEventType.LLM_PREDICTION_END:</code> guards.
CallbackManager not firing events
Ensure you pass the callback_manager to BOTH the index creation (from_documents) AND the query engine (as_query_engine). Settings.callback_manager is global but individual operations can override it.
AttributeError: 'dict' object has no attribute 'message'
The response object structure differs by operation. For LLM prediction, it's <code>response.message.content</code>. For retrieval, it's a list of nodes. Check event type and use <code>type(response)</code> to inspect first.
Handler runs but changes don't take effect
Callbacks are read-only observers. If you need to modify data (filter nodes, transform text), do it in a custom retriever or node postprocessor, not in a callback.

Experienced dev note

In production, you'll want to attach callbacks to measure real token usage and latency, but beware: if your callback handlers themselves are slow (especially on_event_end for every single token in streaming), they become the bottleneck. For high-volume systems, use callbacks for sampled events (every 10th call) or batch them to disk asynchronously. Also, callbacks are single-threaded in the event loop: never put blocking I/O in on_event_start if you're using async queries. Finally, the event payload is mutable but ephemeral; if you need to persist context across callbacks, store it in instance variables on the handler object, not in the event payload.

Check your understanding

You have a callback that logs all LLM predictions to a database on every on_event_end() call. Your query system now feels slow. You're confident the LLM and index are fast. What's the most likely culprit, and what's the fix?

Show answer hint

A correct answer identifies that the callback's on_event_end() is probably doing synchronous I/O (database write) on the event thread, blocking the next operation. The fix is to either (1) make the I/O async, (2) queue events to an async worker, or (3) batch writes instead of one per event.

VERSION In llama-index-core < 0.10.0, callbacks used SimpleCallbackHandler and event structure was different. Since 0.10.0, use BaseCallbackHandler and CBEvent with payload dict. Ensure you're on 0.12.x (April 2026) to access the full event type enum (CBEventType).
NEXT

Now that you can observe LlamaIndex operations via callbacks, the next advanced pattern is implementing custom node postprocessors to filter or rerank retrieved nodes before they reach the LLM: combining callbacks with retrieval customization.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.