Code Beginner easy · 5 min

Loading PDFs, text files, and web pages

What you will learn

Use LlamaIndex's document loaders to ingest PDFs, text files, and web content into memory-ready document objects.

Why this matters

Every RAG pipeline starts with loading unstructured data. If you can't reliably ingest your source material, your retrieval will fail at the foundation: mastering the right loader for each format prevents downstream data quality issues.

Skip if: Don't use SimpleDirectoryReader if you need real-time document updates or streaming ingestion: use Readers with polling logic or live API connectors instead. Don't use web loaders for authenticated content behind paywalls without proper credential handling.

Explanation

Document loaders in LlamaIndex are specialized readers that convert raw files (PDFs, .txt, HTML) into standardized Document objects that the framework understands. Mechanically, a loader reads bytes from disk or network, parses the content (extracting text from PDF pages, splitting HTML into logical chunks), and wraps each piece in metadata (filename, page number, URL). The SimpleDirectoryReader is the all-purpose loader: it auto-detects file types and routes them to the correct parser. For web content, you use SimpleWebPageReader to fetch and parse HTML. Once loaded, documents are memory-resident Python objects ready for embedding and indexing.

Analogy

Think of loaders as postal sorters: raw mail (files) arrives, gets sorted by type (PDF vs text vs web), processed appropriately (extract text, remove formatting), and placed in standardized envelopes (Document objects) that the rest of the mail system (embedding, storage) knows how to handle.

Code

python

import os
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.web import SimpleWebPageReader

# Create sample files for demonstration
os.makedirs('sample_docs', exist_ok=True)
with open('sample_docs/example.txt', 'w') as f:
    f.write('This is a sample text file.\nIt contains multiple lines.\nLlamaIndex will parse it.')

with open('sample_docs/notes.txt', 'w') as f:
    f.write('Project notes:\n- Task 1: Complete\n- Task 2: In progress\n- Task 3: Pending')

# Load all text files from a directory
print('=== Loading text files from directory ===')
reader = SimpleDirectoryReader(input_dir='sample_docs')
documents = reader.load_data()
print(f'Loaded {len(documents)} documents')
for doc in documents:
    print(f'  - {doc.metadata.get("file_name")}: {len(doc.get_content())} characters')

# Load a single web page
print('\n=== Loading web page ===')
web_reader = SimpleWebPageReader()
web_docs = web_reader.load_data(urls=['https://en.wikipedia.org/wiki/Artificial_intelligence'])
print(f'Loaded {len(web_docs)} web document(s)')
for doc in web_docs:
    content_preview = doc.get_content()[:100].replace('\n', ' ')
    print(f'  - Content preview: {content_preview}...')

# Inspect document structure
print('\n=== Document structure ===')
if documents:
    doc = documents[0]
    print(f'Type: {type(doc)}')
    print(f'Metadata keys: {list(doc.metadata.keys())}')
    print(f'Content length: {len(doc.get_content())} characters')
    print(f'First 80 chars: {doc.get_content()[:80]}')

# Cleanup
import shutil
shutil.rmtree('sample_docs')

Output

=== Loading text files from directory ===
Loaded 2 documents
  - example.txt: 74 characters
  - notes.txt: 62 characters

=== Loading web page ===
Loaded 1 web document(s)
  - Content preview: Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the natural intell...

=== Document structure ===
Type: <class 'llama_index.core.schema.Document'>
Metadata keys: ['file_name', 'url']
Content length: 74 characters
First 80 chars: This is a sample text file.
It contains multiple lines.
LlamaIndex will

What just happened?

SimpleDirectoryReader scanned 'sample_docs/', found two .txt files, read their bytes, parsed them as plain text, created Document objects with metadata (filename), and returned them as a list. SimpleWebPageReader fetched the Wikipedia URL, parsed the HTML to extract readable text, and created a Document object with the URL in metadata. The final inspection showed the internal Document structure: it's a wrapper containing content string, metadata dict, and helper methods.

Common gotcha

SimpleDirectoryReader loads ALL files it recognizes in a directory: if you have PDFs, .docx, and .txt files mixed in, it will try to parse them all, and parsing errors on one file don't stop the process, they just skip that file silently. You won't know it failed unless you check the returned document count. Always validate your load succeeded by checking the document count and sampling content.

Error recovery

FileNotFoundError

Input directory path doesn't exist. Verify the path is correct and relative to your current working directory: use os.path.abspath() to check the actual path being used.

URLError or connection timeout on web load

Network request failed or page is unreachable. Check internet connectivity, verify the URL is correct, and consider adding a timeout parameter or retry logic for production use.

Empty documents list

The directory exists but contains no recognized file types, or files couldn't be parsed. Check file extensions (ensure .txt, .pdf, .md) and verify file permissions allow reading.

Experienced dev note

The metadata dict on each Document is your friend for production RAG: file_name and url are auto-populated, but you can add custom metadata before indexing (doc.metadata['source_system'] = 'crm'). This metadata survives all the way through retrieval, so you can trace which source document a retrieved chunk came from. Do this early; it's harder to add metadata retroactively after indexing.

Check your understanding

If you loaded 50 files but only got 30 documents back, what likely happened and how would you diagnose it?

Show answer hint

A correct answer should mention that SimpleDirectoryReader silently skips unparseable files (different file types, corrupted files, unsupported formats), and the diagnosis would involve checking the actual returned document count against the file count, inspecting metadata to see which files were loaded, or reducing the directory to a single test file.

VERSION In llama-index < 0.9.0, loaders were in llama_index.readers: this was consolidated into llama_index.core.SimpleDirectoryReader and llama_index.readers.web in 0.9.0+. Avoid old patterns like 'from llama_index.readers import SimpleFileReader'.

Now that you can load documents, learn how to split them into chunks: documents often exceed context windows, so chunking strategies determine what actually gets embedded and retrieved.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.