How to beginner · 3 min read

How to load PDF with LlamaIndex

Q: How to load PDF with LlamaIndex

Use LlamaIndex with its PyPDFLoader from llama_index to load PDF files easily. Instantiate the loader with the PDF path, then create a GPTVectorStoreIndex from the loaded documents for querying or indexing.

Quick answer

Use LlamaIndex with its PyPDFLoader from llama_index to load PDF files easily. Instantiate the loader with the PDF path, then create a GPTVectorStoreIndex from the loaded documents for querying or indexing.

PREREQUISITES

Python 3.8+
pip install llama-index>=0.6.0
pip install PyPDF2
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the required packages llama-index and PyPDF2 for PDF loading and indexing. Set your OpenAI API key as an environment variable.

bash

pip install llama-index PyPDF2 openai

Step by step

This example loads a PDF file using PyPDFLoader, creates a vector index with GPTVectorStoreIndex, and queries it.

python

import os
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
from llama_index import download_loader

# Set your OpenAI API key in environment
# export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

# Load PyPDFLoader dynamically
PyPDFLoader = download_loader("PyPDFLoader")

# Initialize the loader with your PDF file path
loader = PyPDFLoader("example.pdf")
documents = loader.load_data()

# Create the vector index from documents
index = GPTVectorStoreIndex(documents)

# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic of the document?")
print(response.response)

output

The main topic of the document is ...

Common variations

Use SimpleDirectoryReader to load multiple PDFs from a folder.
Switch to other index types like GPTListIndex for different retrieval strategies.
Use async methods if integrating in async frameworks.

python

from llama_index import SimpleDirectoryReader

# Load all PDFs in a directory
loader = SimpleDirectoryReader("./pdfs", file_extractor="PyPDFLoader")
docs = loader.load_data()

index = GPTVectorStoreIndex(docs)

# Query as usual
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the documents.")
print(response.response)

output

Summary of the documents is ...

Troubleshooting

If you get ModuleNotFoundError for PyPDF2, ensure it is installed with pip install PyPDF2.
If the PDF is encrypted or corrupted, PyPDFLoader may fail to load it.
Set OPENAI_API_KEY correctly in your environment to avoid authentication errors.

✅

Key Takeaways

Use PyPDFLoader from llama_index to load PDF files easily.
Create a GPTVectorStoreIndex from loaded documents for efficient querying.
Install PyPDF2 and set OPENAI_API_KEY in your environment to avoid errors.

Verified 2026-04 · gpt-4o

Verify ↗