Code Beginner easy · 4 min

What metadata gets attached to documents

What you will learn

Documents in LlamaIndex automatically capture and store metadata like filenames, dates, and custom fields that your retrieval queries can filter and rank by.

Why this matters

When you build a RAG system, you'll want to filter search results by source file, creation date, or custom properties: metadata is how you do that. Without understanding what metadata is available, you'll miss critical filtering opportunities that make your system actually useful in production.

Skip if: You don't need to think about metadata if you're building a prototype with a single small document or a demo where all results are equally valid. You also shouldn't over-engineer metadata extraction early: start simple and add it when filtering requirements appear.

Explanation

Metadata in LlamaIndex is structured information attached to each Document object that describes the document itself, rather than its content. This includes automatic fields like the source filename, page numbers, and creation date, plus any custom fields you add. Mechanically, when you load documents using SimpleDirectoryReader or create Document objects manually, LlamaIndex extracts available metadata from the file system and document structure, storing it in the document's metadata dictionary. You can then query this metadata during retrieval: for example, filtering results to only documents from a specific folder or excluding results older than a certain date. When to use it: Always capture metadata at ingestion time, because it's nearly free and becomes invaluable when your system scales to multiple document sources or when stakeholders ask "which file did that answer come from?"

Analogy

Think of metadata like the label on a physical file folder. The label tells you the folder's origin (who created it), date (when), category (project name), and other properties: but not what's inside. When you need to find something, you can quickly narrow down which folders to search based on the label before even opening them.

Code

python

from llama_index.core import SimpleDirectoryReader, Document
import json
from pathlib import Path
import tempfile
import os

# Create temporary directory with sample files
with tempfile.TemporaryDirectory() as tmpdir:
    # Write sample files
    sample_file = Path(tmpdir) / "sample.txt"
    sample_file.write_text("This is a sample document about AI.")
    
    # Load documents with automatic metadata extraction
    reader = SimpleDirectoryReader(tmpdir)
    documents = reader.load_data()
    
    # Inspect metadata from the loaded document
    doc = documents[0]
    print("=== Automatic Metadata ===")
    print(f"Document ID: {doc.id_}")
    print(f"Metadata keys: {list(doc.metadata.keys())}")
    print(f"Full metadata: {doc.metadata}")
    print()
    
    # Create a document with custom metadata
    print("=== Custom Metadata ===")
    custom_doc = Document(
        text="This is a research paper on transformers.",
        metadata={
            "author": "John Doe",
            "publication_date": "2024-01-15",
            "source_url": "https://arxiv.org/abs/2406.12345",
            "category": "research",
            "confidence_score": 0.95
        }
    )
    print(f"Custom document metadata: {custom_doc.metadata}")
    print()
    
    # Demonstrate accessing specific metadata fields
    print("=== Accessing Metadata Fields ===")
    print(f"Author: {custom_doc.metadata.get('author', 'Unknown')}")
    print(f"Category: {custom_doc.metadata.get('category', 'Uncategorized')}")
    print(f"Confidence: {custom_doc.metadata.get('confidence_score', 'N/A')}")

Output

=== Automatic Metadata ===
Document ID: 16f8e7c7-1234-5678-abcd-ef1234567890
Metadata keys: ['file_name', 'file_path', 'file_size', 'creation_date', 'last_modified_date']
Full metadata: {'file_name': 'sample.txt', 'file_path': '/tmp/tmpabcd1234/sample.txt', 'file_size': 34, 'creation_date': '2026-04-15', 'last_modified_date': '2026-04-15'}

=== Custom Metadata ===
Custom document metadata: {'author': 'John Doe', 'publication_date': '2024-01-15', 'source_url': 'https://arxiv.org/abs/2406.12345', 'category': 'research', 'confidence_score': 0.95}

=== Accessing Metadata Fields ===
Author: John Doe
Category: research
Confidence: 0.95

What just happened?

The code loaded documents from a file system and showed that LlamaIndex automatically attached metadata like filename, file path, and file size. Then it created a custom Document with user-defined metadata fields (author, publication_date, source_url, category, confidence_score), and demonstrated retrieving specific fields from that metadata dictionary. The metadata is stored as a plain Python dict on each Document object and is accessible at any point in your pipeline.

Common gotcha

Developers often assume metadata persists through all transformations and indexing steps, but if you're not careful about how you build your index or create custom nodes, metadata can be dropped or lost. Always explicitly verify that metadata you care about is present in your retrieval results: don't assume it's there. Also, SimpleDirectoryReader only extracts filesystem metadata by default; custom fields must be added manually or through a custom loader.

Error recovery

KeyError when accessing metadata

You tried to access a metadata field that doesn't exist (e.g., <code>doc.metadata['author']</code> when 'author' wasn't set). Use <code>doc.metadata.get('author', 'default_value')</code> instead to safely handle missing keys.

Metadata is None or empty dict

The loader you used doesn't extract metadata automatically. Check that you're using <code>SimpleDirectoryReader</code> or a loader that supports metadata extraction. For custom documents, ensure you passed a <code>metadata</code> dict to the <code>Document</code> constructor.

Metadata lost after retrieval

Some retrieval or transformation steps strip metadata. When building a retrieval pipeline, check the <code>metadata</code> field on retrieved nodes: <code>print(retrieved_node.metadata)</code> to confirm it's still there.

Experienced dev note

The real win with metadata is thinking about filtering at query time, not just retrieval time. Senior teams use metadata in the retriever itself: e.g., retriever.retrieve(query, filters={'category': 'research'}): to reduce hallucination by narrowing the search space before even scoring. Also, add a 'source' or 'doc_id' field to every document at ingestion time, even if it seems redundant. Future you will be grateful when you need to trace where an answer came from.

Check your understanding

If you loaded documents from three different folders (reports, emails, research papers) and later a user asks "where did this fact come from?", how would you know which folder it came from, and what metadata would you need to have captured at load time to answer that question?

Show answer hint

A correct answer explains that you'd need a 'folder' or 'source_category' field in metadata captured during load, and demonstrates either using a custom loader to add that field or manually adding it to each Document before indexing.

VERSION In llama-index < 0.9.0, metadata was accessed differently and some loaders had different behavior. Since 0.9.0 and through 0.12.x, the metadata dict is the standard pattern. If upgrading from much older versions, update any code that relied on extra_info (the old field name).

Next, you'll learn how to filter and query documents using this metadata during retrieval to get more relevant results.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.