What metadata gets attached to documents
Why this matters
When you build a RAG system, you'll want to filter search results by source file, creation date, or custom properties: metadata is how you do that. Without understanding what metadata is available, you'll miss critical filtering opportunities that make your system actually useful in production.
Explanation
Metadata in LlamaIndex is structured information attached to each Document object that describes the document itself, rather than its content. This includes automatic fields like the source filename, page numbers, and creation date, plus any custom fields you add. Mechanically, when you load documents using SimpleDirectoryReader or create Document objects manually, LlamaIndex extracts available metadata from the file system and document structure, storing it in the document's metadata dictionary. You can then query this metadata during retrieval: for example, filtering results to only documents from a specific folder or excluding results older than a certain date. When to use it: Always capture metadata at ingestion time, because it's nearly free and becomes invaluable when your system scales to multiple document sources or when stakeholders ask "which file did that answer come from?"
Analogy
Think of metadata like the label on a physical file folder. The label tells you the folder's origin (who created it), date (when), category (project name), and other properties: but not what's inside. When you need to find something, you can quickly narrow down which folders to search based on the label before even opening them.
Code
from llama_index.core import SimpleDirectoryReader, Document
import json
from pathlib import Path
import tempfile
import os
# Create temporary directory with sample files
with tempfile.TemporaryDirectory() as tmpdir:
# Write sample files
sample_file = Path(tmpdir) / "sample.txt"
sample_file.write_text("This is a sample document about AI.")
# Load documents with automatic metadata extraction
reader = SimpleDirectoryReader(tmpdir)
documents = reader.load_data()
# Inspect metadata from the loaded document
doc = documents[0]
print("=== Automatic Metadata ===")
print(f"Document ID: {doc.id_}")
print(f"Metadata keys: {list(doc.metadata.keys())}")
print(f"Full metadata: {doc.metadata}")
print()
# Create a document with custom metadata
print("=== Custom Metadata ===")
custom_doc = Document(
text="This is a research paper on transformers.",
metadata={
"author": "John Doe",
"publication_date": "2024-01-15",
"source_url": "https://arxiv.org/abs/2406.12345",
"category": "research",
"confidence_score": 0.95
}
)
print(f"Custom document metadata: {custom_doc.metadata}")
print()
# Demonstrate accessing specific metadata fields
print("=== Accessing Metadata Fields ===")
print(f"Author: {custom_doc.metadata.get('author', 'Unknown')}")
print(f"Category: {custom_doc.metadata.get('category', 'Uncategorized')}")
print(f"Confidence: {custom_doc.metadata.get('confidence_score', 'N/A')}") === Automatic Metadata ===
Document ID: 16f8e7c7-1234-5678-abcd-ef1234567890
Metadata keys: ['file_name', 'file_path', 'file_size', 'creation_date', 'last_modified_date']
Full metadata: {'file_name': 'sample.txt', 'file_path': '/tmp/tmpabcd1234/sample.txt', 'file_size': 34, 'creation_date': '2026-04-15', 'last_modified_date': '2026-04-15'}
=== Custom Metadata ===
Custom document metadata: {'author': 'John Doe', 'publication_date': '2024-01-15', 'source_url': 'https://arxiv.org/abs/2406.12345', 'category': 'research', 'confidence_score': 0.95}
=== Accessing Metadata Fields ===
Author: John Doe
Category: research
Confidence: 0.95 What just happened?
The code loaded documents from a file system and showed that LlamaIndex automatically attached metadata like filename, file path, and file size. Then it created a custom Document with user-defined metadata fields (author, publication_date, source_url, category, confidence_score), and demonstrated retrieving specific fields from that metadata dictionary. The metadata is stored as a plain Python dict on each Document object and is accessible at any point in your pipeline.
Common gotcha
Developers often assume metadata persists through all transformations and indexing steps, but if you're not careful about how you build your index or create custom nodes, metadata can be dropped or lost. Always explicitly verify that metadata you care about is present in your retrieval results: don't assume it's there. Also, SimpleDirectoryReader only extracts filesystem metadata by default; custom fields must be added manually or through a custom loader.
Error recovery
KeyError when accessing metadataMetadata is None or empty dictMetadata lost after retrievalExperienced dev note
The real win with metadata is thinking about filtering at query time, not just retrieval time. Senior teams use metadata in the retriever itself: e.g., retriever.retrieve(query, filters={'category': 'research'}): to reduce hallucination by narrowing the search space before even scoring. Also, add a 'source' or 'doc_id' field to every document at ingestion time, even if it seems redundant. Future you will be grateful when you need to trace where an answer came from.
Check your understanding
If you loaded documents from three different folders (reports, emails, research papers) and later a user asks "where did this fact come from?", how would you know which folder it came from, and what metadata would you need to have captured at load time to answer that question?
Show answer hint
A correct answer explains that you'd need a 'folder' or 'source_category' field in metadata captured during load, and demonstrates either using a custom loader to add that field or manually adding it to each Document before indexing.
metadata dict is the standard pattern. If upgrading from much older versions, update any code that relied on extra_info (the old field name).