How to beginner · 3 min read

How to load PDF in LangChain

Quick answer
Use the PyPDFLoader from langchain_community.document_loaders to load PDF files in LangChain. Instantiate it with the PDF file path, then call load() to get the document content as LangChain Document objects.

PREREQUISITES

  • Python 3.8+
  • pip install langchain langchain_community
  • A PDF file to load

Setup

Install the required packages with pip and prepare your environment.

bash
pip install langchain langchain_community

Step by step

Use PyPDFLoader to load a PDF file and extract its text content as LangChain documents.

python
from langchain_community.document_loaders import PyPDFLoader

# Path to your PDF file
pdf_path = "example.pdf"

# Initialize the loader
loader = PyPDFLoader(pdf_path)

# Load the documents
documents = loader.load()

# Print the first page content
print(documents[0].page_content)
output
This is the text content of the first page of example.pdf...

Common variations

  • Use load_and_split() to split the PDF into smaller chunks for better processing.
  • Use other loaders like UnstructuredPDFLoader for more complex PDFs.
  • Combine with LangChain text splitting and embeddings for downstream tasks.
python
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example.pdf")
docs = loader.load_and_split()
print(f"Loaded {len(docs)} chunks from PDF")
output
Loaded 10 chunks from PDF

Troubleshooting

  • If you get errors loading the PDF, verify the file path and that the PDF is not corrupted.
  • For scanned PDFs, text extraction may fail; consider OCR preprocessing.
  • Ensure langchain_community is up to date for latest loader fixes.

Key Takeaways

  • Use PyPDFLoader from langchain_community.document_loaders to load PDFs easily.
  • Call load() for full document or load_and_split() for chunked text.
  • Check PDF integrity and consider OCR for scanned documents before loading.
  • Keep langchain_community updated for best compatibility and features.
Verified 2026-04
Verify ↗