Loading PDFs, text files, and web pages
Why this matters
Every RAG pipeline starts with loading unstructured data. If you can't reliably ingest your source material, your retrieval will fail at the foundation: mastering the right loader for each format prevents downstream data quality issues.
Explanation
Document loaders in LlamaIndex are specialized readers that convert raw files (PDFs, .txt, HTML) into standardized Document objects that the framework understands. Mechanically, a loader reads bytes from disk or network, parses the content (extracting text from PDF pages, splitting HTML into logical chunks), and wraps each piece in metadata (filename, page number, URL). The SimpleDirectoryReader is the all-purpose loader: it auto-detects file types and routes them to the correct parser. For web content, you use SimpleWebPageReader to fetch and parse HTML. Once loaded, documents are memory-resident Python objects ready for embedding and indexing.
Analogy
Think of loaders as postal sorters: raw mail (files) arrives, gets sorted by type (PDF vs text vs web), processed appropriately (extract text, remove formatting), and placed in standardized envelopes (Document objects) that the rest of the mail system (embedding, storage) knows how to handle.
Code
import os
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.web import SimpleWebPageReader
# Create sample files for demonstration
os.makedirs('sample_docs', exist_ok=True)
with open('sample_docs/example.txt', 'w') as f:
f.write('This is a sample text file.\nIt contains multiple lines.\nLlamaIndex will parse it.')
with open('sample_docs/notes.txt', 'w') as f:
f.write('Project notes:\n- Task 1: Complete\n- Task 2: In progress\n- Task 3: Pending')
# Load all text files from a directory
print('=== Loading text files from directory ===')
reader = SimpleDirectoryReader(input_dir='sample_docs')
documents = reader.load_data()
print(f'Loaded {len(documents)} documents')
for doc in documents:
print(f' - {doc.metadata.get("file_name")}: {len(doc.get_content())} characters')
# Load a single web page
print('\n=== Loading web page ===')
web_reader = SimpleWebPageReader()
web_docs = web_reader.load_data(urls=['https://en.wikipedia.org/wiki/Artificial_intelligence'])
print(f'Loaded {len(web_docs)} web document(s)')
for doc in web_docs:
content_preview = doc.get_content()[:100].replace('\n', ' ')
print(f' - Content preview: {content_preview}...')
# Inspect document structure
print('\n=== Document structure ===')
if documents:
doc = documents[0]
print(f'Type: {type(doc)}')
print(f'Metadata keys: {list(doc.metadata.keys())}')
print(f'Content length: {len(doc.get_content())} characters')
print(f'First 80 chars: {doc.get_content()[:80]}')
# Cleanup
import shutil
shutil.rmtree('sample_docs') === Loading text files from directory === Loaded 2 documents - example.txt: 74 characters - notes.txt: 62 characters === Loading web page === Loaded 1 web document(s) - Content preview: Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the natural intell... === Document structure === Type: <class 'llama_index.core.schema.Document'> Metadata keys: ['file_name', 'url'] Content length: 74 characters First 80 chars: This is a sample text file. It contains multiple lines. LlamaIndex will
What just happened?
SimpleDirectoryReader scanned 'sample_docs/', found two .txt files, read their bytes, parsed them as plain text, created Document objects with metadata (filename), and returned them as a list. SimpleWebPageReader fetched the Wikipedia URL, parsed the HTML to extract readable text, and created a Document object with the URL in metadata. The final inspection showed the internal Document structure: it's a wrapper containing content string, metadata dict, and helper methods.
Common gotcha
SimpleDirectoryReader loads ALL files it recognizes in a directory: if you have PDFs, .docx, and .txt files mixed in, it will try to parse them all, and parsing errors on one file don't stop the process, they just skip that file silently. You won't know it failed unless you check the returned document count. Always validate your load succeeded by checking the document count and sampling content.
Error recovery
FileNotFoundErrorURLError or connection timeout on web loadEmpty documents listExperienced dev note
The metadata dict on each Document is your friend for production RAG: file_name and url are auto-populated, but you can add custom metadata before indexing (doc.metadata['source_system'] = 'crm'). This metadata survives all the way through retrieval, so you can trace which source document a retrieved chunk came from. Do this early; it's harder to add metadata retroactively after indexing.
Check your understanding
If you loaded 50 files but only got 30 documents back, what likely happened and how would you diagnose it?
Show answer hint
A correct answer should mention that SimpleDirectoryReader silently skips unparseable files (different file types, corrupted files, unsupported formats), and the diagnosis would involve checking the actual returned document count against the file count, inspecting metadata to see which files were loaded, or reducing the directory to a single test file.