What document loaders do
Why this matters
Most real-world LLM applications need to ingest external data: documents, databases, APIs. Document loaders are the standardized bridge between your file system and your LLM chain. Without them, you're manually parsing files and building Document objects, which is error-prone and defeats LangChain's purpose.
Explanation
What it is: A document loader is a LangChain class that reads files or content from a source and converts them into Document objects: lightweight Python objects with a page_content string and optional metadata dict. Examples: PyPDFLoader, TextLoader, WebBaseLoader.
How it works: You instantiate a loader with a file path or URL, call .load() to read and parse, and get back a list of Document objects. Each Document is chunked (if the loader supports it) or returned as one object per file. The metadata automatically captures source, page numbers, or other context: critical for retrieval and citations later.
When to use it: Use document loaders as the first step in any RAG (Retrieval-Augmented Generation) pipeline. Load your documents once, split them with a text splitter, embed them, and store in a vector database. Loaders handle the messy parsing work so your chain sees clean, standardized Document objects.
Analogy
A document loader is like a mail sorter at a post office: it takes messy, varied input (letters in envelopes, packages, different sizes) and converts it to a standardized format (all mail sorted into bins with labels). Your downstream system (the delivery trucks: your LLM) doesn't care what the original format was; it just processes standardized packages.
Code
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
loader = TextLoader('sample.txt')
documents = loader.load()
print(f'Number of documents: {len(documents)}')
print(f'First document type: {type(documents[0])}')
print(f'Page content length: {len(documents[0].page_content)}')
print(f'Metadata: {documents[0].metadata}')
splitter = CharacterTextSplitter(
chunk_size=200,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
print(f'\nNumber of chunks after split: {len(chunks)}')
print(f'First chunk:\n{chunks[0].page_content}')
print(f'Chunk metadata: {chunks[0].metadata}') Number of documents: 1
First document type: <class 'langchain_core.documents.base.Document'>
Page content length: 512
Metadata: {'source': 'sample.txt'}
Number of chunks after split: 4
First chunk:
LangChain is a framework for building applications with large language models. It provides abstractions and tools to simplify the development of LLM-based systems. Document loaders
Chunk metadata: {'source': 'sample.txt'} What just happened?
The TextLoader opened 'sample.txt', parsed its content, and returned a single Document object with the file content in <code>page_content</code> and the file path in <code>metadata</code>. Then the CharacterTextSplitter broke that one Document into 4 smaller chunks (200 characters each with 50-character overlap), preserving the metadata in each chunk. Every chunk still knows its source file.
Common gotcha
Developers assume document loaders automatically chunk text: they don't. .load() returns one Document per file (or per page for PDFs), not per semantic chunk. You must explicitly call a text splitter afterward if you want chunks. This is intentional (metadata is cleaner before splitting), but it surprises people coming from simple file-reading patterns. Also: loaders return a list of Documents, not a single Document, even for one file: always access index 0 or iterate.
Error recovery
FileNotFoundErrorModuleNotFoundError: No module named 'pypdf'ImportError: cannot import name 'TextLoader' from 'langchain_community'AttributeError: 'Document' object has no attribute 'content'Experienced dev note
Document loaders are stateless factories: they don't cache or maintain state. If you load the same file twice, you get fresh Document objects. This is good for reliability but bad for performance if you're loading gigabytes of documents in a loop. Load once, cache the Document list in memory or a vector store, reuse it. Also: metadata is your friend. Set custom metadata (author, category, date) on documents before embedding; it's searchable and shows up in retrieval results for debugging and citations. Don't rely on just filename in production.
Check your understanding
Why would you NOT split documents immediately after loading them, even if you know your vector store expects 300-token chunks? What problem could that solve?
Show answer hint
A correct answer mentions that metadata is tied to the original source document. If you split before deciding your strategy, you lose the clean source reference. You might also want to inspect documents before chunking to decide chunk size per-document type, or reuse the same loaded documents with different split strategies.
.load() and .load_and_split() work as documented. No breaking changes expected in 1.x line.