Code Beginner easy · 5 min

Creating an index from documents

What you will learn

Build a searchable vector index from raw documents in three lines of code.

Why this matters

Indexing is the foundation of retrieval-augmented generation (RAG). Without it, you're querying raw documents every time: slow, expensive, and won't work at scale. An index lets you search semantic meaning instead of keywords.

Skip if: Don't use VectorStoreIndex if your documents are already indexed in an external vector database (Pinecone, Weaviate, Qdrant). Instead, use the corresponding connector class. Also skip indexing if your document set is tiny (< 100 tokens total) and latency doesn't matter.

Explanation

What it is: VectorStoreIndex converts your raw documents into embeddings (numerical vectors representing meaning) and stores them in a searchable format. When you query later, your question gets embedded too, and the index finds documents with similar embeddings.

How it works: The index pipeline reads documents, chunks them into manageable pieces, generates embeddings for each chunk using an LLM embedding model, and stores those vectors with metadata. Under the hood, it uses a vector store (in-memory by default) that compares your query embedding against stored embeddings using cosine similarity to find the closest matches.

When to use it: Use VectorStoreIndex whenever you need semantic search over documents: customer support FAQs, documentation Q&A, research paper search, or any RAG application. It's the most common starting point in LlamaIndex because it handles the entire pipeline for you.

Analogy

Think of it like creating a library card catalog. Documents are books. Embeddings are the metadata tags (genre, topic, sentiment) you assign to each book. When someone asks a question, you don't read every book: you look up the tags that match the question and return the most similar books.

Code

Illustrative only - not runnable without a valid API key

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
import os

os.environ['OPENAI_API_KEY'] = 'sk-your-key-here'

Settings.llm = OpenAI(model='gpt-4o-mini')

documents = SimpleDirectoryReader('./documents').load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query('What is machine learning?')

print(response)

Output

Machine learning is a branch of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It involves training algorithms on data to identify patterns and make predictions or decisions based on those patterns.

What just happened?

SimpleDirectoryReader loaded all text files from ./documents/ into Document objects. VectorStoreIndex.from_documents() split each document into chunks, called OpenAI's embedding API to convert text into vectors, and stored those vectors in an in-memory vector store with pointers back to the original text. When you called query(), the index embedded your question, found the 2 most similar document chunks by cosine similarity, and passed them to gpt-4o-mini along with your question to generate the response.

Common gotcha

Developers often assume that more documents = better answers. But if your documents don't contain the answer, indexing won't magically find it. Also, the index is in-memory by default: close your Python process and it's gone. For persistent storage, you need to explicitly configure a vector store like FAISS or connect to an external vector database.

Error recovery

FileNotFoundError: [Errno 2] No such file or directory: './documents'

The ./documents directory doesn't exist. Create it or pass the correct path to SimpleDirectoryReader(). Example: SimpleDirectoryReader('./data').load_data()

AuthenticationError: Invalid API key provided

Your OPENAI_API_KEY environment variable is missing, empty, or malformed. Set it before running: export OPENAI_API_KEY='sk-...' or load it from a .env file.

ValueError: No documents found

SimpleDirectoryReader couldn't find any supported files (txt, pdf, md, etc.) in the directory. Check that files exist and have readable extensions.

ImportError: cannot import name 'VectorStoreIndex'

You're using an old llama-index version or importing from the wrong module. Update to llama-index-core >= 0.12.0 and import from llama_index.core, not llama_index.

Experienced dev note

The embedding model choice matters more than you think. By default, Settings uses OpenAI's text-embedding-3-small, which costs $0.02 per 1M tokens. But if you're indexing large corpora, switch to a local embedding model (sentence-transformers via HuggingFaceEmbedding) to avoid embedding costs and latency. Also: chunk size affects quality. The default 1024 tokens works for most cases, but dense technical docs need smaller chunks (512) and sparse narrative docs can handle larger ones (2048). Finally, always test your index with a few queries before going to production: embeddings can fail silently if the model and query don't align well.

Check your understanding

If you indexed 100 documents but your query returns results from only 2-3 of them, and those 2-3 results are irrelevant to your question, what is the most likely root cause: the index is broken, or something about your document content or query phrasing doesn't match? How would you diagnose this without re-indexing?

Show answer hint

A correct answer recognizes that indexing itself can't fail silently: the real issue is semantic mismatch between documents and query. Diagnosis methods include: (1) manually checking if relevant docs were actually included in the index, (2) inspecting the retrieved chunks directly via retriever.retrieve(query) to see what got returned before the LLM, (3) testing with very specific keywords from known relevant docs, and (4) checking embedding dimensions and model compatibility.

VERSION In llama-index < 0.10.0, the import was from llama_index import GPTVectorStoreIndex and required ServiceContext. Both are removed in 0.12.x. Use VectorStoreIndex and Settings instead.

Once indexing works, learn how to customize retrieval by adjusting similarity_top_k and filtering results with metadata: this is how you shape what documents the index actually returns to your queries.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.