Creating an index from documents
Why this matters
Indexing is the foundation of retrieval-augmented generation (RAG). Without it, you're querying raw documents every time: slow, expensive, and won't work at scale. An index lets you search semantic meaning instead of keywords.
Explanation
What it is: VectorStoreIndex converts your raw documents into embeddings (numerical vectors representing meaning) and stores them in a searchable format. When you query later, your question gets embedded too, and the index finds documents with similar embeddings.
How it works: The index pipeline reads documents, chunks them into manageable pieces, generates embeddings for each chunk using an LLM embedding model, and stores those vectors with metadata. Under the hood, it uses a vector store (in-memory by default) that compares your query embedding against stored embeddings using cosine similarity to find the closest matches.
When to use it: Use VectorStoreIndex whenever you need semantic search over documents: customer support FAQs, documentation Q&A, research paper search, or any RAG application. It's the most common starting point in LlamaIndex because it handles the entire pipeline for you.
Analogy
Think of it like creating a library card catalog. Documents are books. Embeddings are the metadata tags (genre, topic, sentiment) you assign to each book. When someone asks a question, you don't read every book: you look up the tags that match the question and return the most similar books.
Code
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
import os
os.environ['OPENAI_API_KEY'] = 'sk-your-key-here'
Settings.llm = OpenAI(model='gpt-4o-mini')
documents = SimpleDirectoryReader('./documents').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query('What is machine learning?')
print(response) Machine learning is a branch of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It involves training algorithms on data to identify patterns and make predictions or decisions based on those patterns.
What just happened?
SimpleDirectoryReader loaded all text files from ./documents/ into Document objects. VectorStoreIndex.from_documents() split each document into chunks, called OpenAI's embedding API to convert text into vectors, and stored those vectors in an in-memory vector store with pointers back to the original text. When you called query(), the index embedded your question, found the 2 most similar document chunks by cosine similarity, and passed them to gpt-4o-mini along with your question to generate the response.
Common gotcha
Developers often assume that more documents = better answers. But if your documents don't contain the answer, indexing won't magically find it. Also, the index is in-memory by default: close your Python process and it's gone. For persistent storage, you need to explicitly configure a vector store like FAISS or connect to an external vector database.
Error recovery
FileNotFoundError: [Errno 2] No such file or directory: './documents'AuthenticationError: Invalid API key providedValueError: No documents foundImportError: cannot import name 'VectorStoreIndex'Experienced dev note
The embedding model choice matters more than you think. By default, Settings uses OpenAI's text-embedding-3-small, which costs $0.02 per 1M tokens. But if you're indexing large corpora, switch to a local embedding model (sentence-transformers via HuggingFaceEmbedding) to avoid embedding costs and latency. Also: chunk size affects quality. The default 1024 tokens works for most cases, but dense technical docs need smaller chunks (512) and sparse narrative docs can handle larger ones (2048). Finally, always test your index with a few queries before going to production: embeddings can fail silently if the model and query don't align well.
Check your understanding
If you indexed 100 documents but your query returns results from only 2-3 of them, and those 2-3 results are irrelevant to your question, what is the most likely root cause: the index is broken, or something about your document content or query phrasing doesn't match? How would you diagnose this without re-indexing?
Show answer hint
A correct answer recognizes that indexing itself can't fail silently: the real issue is semantic mismatch between documents and query. Diagnosis methods include: (1) manually checking if relevant docs were actually included in the index, (2) inspecting the retrieved chunks directly via retriever.retrieve(query) to see what got returned before the LLM, (3) testing with very specific keywords from known relevant docs, and (4) checking embedding dimensions and model compatibility.