Code Intermediate medium · 6 min

What document loaders do

What you will learn

Document loaders convert files (PDF, TXT, JSON, web pages) into LangChain Document objects that LLMs can process.

Why this matters

Most real-world LLM applications need to ingest external data: documents, databases, APIs. Document loaders are the standardized bridge between your file system and your LLM chain. Without them, you're manually parsing files and building Document objects, which is error-prone and defeats LangChain's purpose.

Skip if: Don't use document loaders if your data is already structured in a vector store or database: query those directly with retrievers instead. Don't use them for tiny inline text you're hardcoding into prompts.

Explanation

What it is: A document loader is a LangChain class that reads files or content from a source and converts them into Document objects: lightweight Python objects with a page_content string and optional metadata dict. Examples: PyPDFLoader, TextLoader, WebBaseLoader.

How it works: You instantiate a loader with a file path or URL, call .load() to read and parse, and get back a list of Document objects. Each Document is chunked (if the loader supports it) or returned as one object per file. The metadata automatically captures source, page numbers, or other context: critical for retrieval and citations later.

When to use it: Use document loaders as the first step in any RAG (Retrieval-Augmented Generation) pipeline. Load your documents once, split them with a text splitter, embed them, and store in a vector database. Loaders handle the messy parsing work so your chain sees clean, standardized Document objects.

Analogy

A document loader is like a mail sorter at a post office: it takes messy, varied input (letters in envelopes, packages, different sizes) and converts it to a standardized format (all mail sorted into bins with labels). Your downstream system (the delivery trucks: your LLM) doesn't care what the original format was; it just processes standardized packages.

Code

Illustrative only - not runnable without a valid API key

python

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader('sample.txt')
documents = loader.load()

print(f'Number of documents: {len(documents)}')
print(f'First document type: {type(documents[0])}')
print(f'Page content length: {len(documents[0].page_content)}')
print(f'Metadata: {documents[0].metadata}')

splitter = CharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)
print(f'\nNumber of chunks after split: {len(chunks)}')
print(f'First chunk:\n{chunks[0].page_content}')
print(f'Chunk metadata: {chunks[0].metadata}')

Output

Number of documents: 1
First document type: <class 'langchain_core.documents.base.Document'>
Page content length: 512
Metadata: {'source': 'sample.txt'}

Number of chunks after split: 4
First chunk:
LangChain is a framework for building applications with large language models. It provides abstractions and tools to simplify the development of LLM-based systems. Document loaders
Chunk metadata: {'source': 'sample.txt'}

What just happened?

The TextLoader opened 'sample.txt', parsed its content, and returned a single Document object with the file content in <code>page_content</code> and the file path in <code>metadata</code>. Then the CharacterTextSplitter broke that one Document into 4 smaller chunks (200 characters each with 50-character overlap), preserving the metadata in each chunk. Every chunk still knows its source file.

Common gotcha

Developers assume document loaders automatically chunk text: they don't. .load() returns one Document per file (or per page for PDFs), not per semantic chunk. You must explicitly call a text splitter afterward if you want chunks. This is intentional (metadata is cleaner before splitting), but it surprises people coming from simple file-reading patterns. Also: loaders return a list of Documents, not a single Document, even for one file: always access index 0 or iterate.

Error recovery

FileNotFoundError

The file path passed to the loader doesn't exist or is relative but executed from the wrong directory. Use absolute paths or verify the file exists before instantiating the loader.

ModuleNotFoundError: No module named 'pypdf'

PyPDFLoader requires the 'pypdf' package. Install with: pip install pypdf. Same pattern for other loaders: each format may need its dependency.

ImportError: cannot import name 'TextLoader' from 'langchain_community'

TextLoader is in langchain-community. Install it: pip install langchain-community. As of langchain 1.2.x, most loaders moved here, not in base langchain package.

AttributeError: 'Document' object has no attribute 'content'

Document objects use <code>page_content</code>, not <code>content</code>. Always access <code>doc.page_content</code> and <code>doc.metadata</code>.

Experienced dev note

Document loaders are stateless factories: they don't cache or maintain state. If you load the same file twice, you get fresh Document objects. This is good for reliability but bad for performance if you're loading gigabytes of documents in a loop. Load once, cache the Document list in memory or a vector store, reuse it. Also: metadata is your friend. Set custom metadata (author, category, date) on documents before embedding; it's searchable and shows up in retrieval results for debugging and citations. Don't rely on just filename in production.

Check your understanding

Why would you NOT split documents immediately after loading them, even if you know your vector store expects 300-token chunks? What problem could that solve?

Show answer hint

A correct answer mentions that metadata is tied to the original source document. If you split before deciding your strategy, you lose the clean source reference. You might also want to inspect documents before chunking to decide chunk size per-document type, or reuse the same loaded documents with different split strategies.

VERSION As of langchain 1.2.x (April 2026), document loaders have moved to langchain-community (langchain-community >=0.2.0). The API is stable: .load() and .load_and_split() work as documented. No breaking changes expected in 1.x line.

Once you have Documents loaded, you'll need to split them with TextSplitter: learn how to choose chunk size and overlap strategy for your data type.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.