How to beginner · 3 min read

How to load PDF with LlamaIndex

Quick answer
Use LlamaIndex with its PyPDFLoader from llama_index to load PDF files easily. Instantiate the loader with the PDF path, then create a GPTVectorStoreIndex from the loaded documents for querying or indexing.

PREREQUISITES

  • Python 3.8+
  • pip install llama-index>=0.6.0
  • pip install PyPDF2
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the required packages llama-index and PyPDF2 for PDF loading and indexing. Set your OpenAI API key as an environment variable.

bash
pip install llama-index PyPDF2 openai

Step by step

This example loads a PDF file using PyPDFLoader, creates a vector index with GPTVectorStoreIndex, and queries it.

python
import os
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
from llama_index import download_loader

# Set your OpenAI API key in environment
# export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

# Load PyPDFLoader dynamically
PyPDFLoader = download_loader("PyPDFLoader")

# Initialize the loader with your PDF file path
loader = PyPDFLoader("example.pdf")
documents = loader.load_data()

# Create the vector index from documents
index = GPTVectorStoreIndex(documents)

# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic of the document?")
print(response.response)
output
The main topic of the document is ...

Common variations

  • Use SimpleDirectoryReader to load multiple PDFs from a folder.
  • Switch to other index types like GPTListIndex for different retrieval strategies.
  • Use async methods if integrating in async frameworks.
python
from llama_index import SimpleDirectoryReader

# Load all PDFs in a directory
loader = SimpleDirectoryReader("./pdfs", file_extractor="PyPDFLoader")
docs = loader.load_data()

index = GPTVectorStoreIndex(docs)

# Query as usual
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the documents.")
print(response.response)
output
Summary of the documents is ...

Troubleshooting

  • If you get ModuleNotFoundError for PyPDF2, ensure it is installed with pip install PyPDF2.
  • If the PDF is encrypted or corrupted, PyPDFLoader may fail to load it.
  • Set OPENAI_API_KEY correctly in your environment to avoid authentication errors.

Key Takeaways

  • Use PyPDFLoader from llama_index to load PDF files easily.
  • Create a GPTVectorStoreIndex from loaded documents for efficient querying.
  • Install PyPDF2 and set OPENAI_API_KEY in your environment to avoid errors.
Verified 2026-04 · gpt-4o
Verify ↗