Core concepts: index, query engine, retriever
Why this matters
These three components are the foundation of every RAG system you'll build with llamaindex. Understanding how they fit together is essential before you can customize retrieval, add filters, or optimize for production.
Explanation
What it is: llamaindex provides three core abstractions. An Index is a data structure that organizes and stores your documents for efficient retrieval. A Retriever extracts the most relevant chunks from that index in response to a query. A QueryEngine wraps a retriever and an LLM together: it retrieves relevant context and then uses an LLM to synthesize an answer.
How it works mechanically: You feed documents into an Index (typically a VectorStoreIndex). When you call .as_retriever() on that index, you get a retriever that can find semantically similar chunks. When you call .as_query_engine(), llamaindex internally creates a retriever and wraps it with an LLM: so a single query call does retrieval + LLM synthesis automatically.
When to use each: Use a retriever directly when you only need ranked search results. Use a query engine when you need the LLM to reason over those results and produce prose answers. Think of the retriever as the "search" layer and the query engine as the "reasoning" layer.
Analogy
Imagine a library system. The <code>Index</code> is the catalog and organized shelves. The <code>Retriever</code> is the librarian who finds the most relevant books on your topic. The <code>QueryEngine</code> is the librarian who also reads those books and writes a custom report answering your specific question.
Code
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
raise ValueError('OPENAI_API_KEY environment variable not set')
Settings.llm = OpenAI(api_key=api_key, model='gpt-4.1')
Settings.embed_model = OpenAIEmbedding(api_key=api_key, model='text-embedding-3-small')
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=2)
query_engine = index.as_query_engine(similarity_top_k=2)
retrieval_result = retriever.retrieve('What is machine learning?')
print('Retriever returned nodes:')
for node in retrieval_result:
print(f' - {node.get_content()[:60]}...')
query_result = query_engine.query('What is machine learning?')
print(f'\nQuery engine response:\n{query_result}') Retriever returned nodes: - Machine learning is a subset of artificial intell... - Supervised learning uses labeled data to train m... Query engine response: Machine learning is a computational approach where...
What just happened?
The code created an index from documents in a 'data' directory, extracted a retriever (which only fetches relevant chunks), and extracted a query engine (which retrieves chunks and passes them to an LLM). The retriever printed raw node content; the query engine printed a synthesized LLM response built from those same retrieved chunks.
Common gotcha
Developers often assume the retriever and query engine both call the LLM. They don't: the retriever returns raw chunks with no LLM processing. Only the query engine calls the LLM. If you use a retriever expecting natural language answers, you'll get unprocessed text snippets instead.
Error recovery
ImportError: cannot import name 'VectorStoreIndex'DirectoryNotFoundError: [Errno 2] No such file or directory: 'data'RateLimitError from OpenAIExperienced dev note
The query engine is the convenience layer, but it hides what's actually happening. For production systems, separate your retriever and LLM calls explicitly: it gives you finer control over cost (you might not need to call the LLM every time), error handling, and observability. Also: similarity_top_k=2 is tiny for real data; start with 5-10 and measure quality vs. cost.