How to load PDF with LlamaIndex
Quick answer
Use
LlamaIndex with its PyPDFLoader from llama_index to load PDF files easily. Instantiate the loader with the PDF path, then create a GPTVectorStoreIndex from the loaded documents for querying or indexing.PREREQUISITES
Python 3.8+pip install llama-index>=0.6.0pip install PyPDF2OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the required packages llama-index and PyPDF2 for PDF loading and indexing. Set your OpenAI API key as an environment variable.
pip install llama-index PyPDF2 openai Step by step
This example loads a PDF file using PyPDFLoader, creates a vector index with GPTVectorStoreIndex, and queries it.
import os
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
from llama_index import download_loader
# Set your OpenAI API key in environment
# export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]
# Load PyPDFLoader dynamically
PyPDFLoader = download_loader("PyPDFLoader")
# Initialize the loader with your PDF file path
loader = PyPDFLoader("example.pdf")
documents = loader.load_data()
# Create the vector index from documents
index = GPTVectorStoreIndex(documents)
# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic of the document?")
print(response.response) output
The main topic of the document is ...
Common variations
- Use
SimpleDirectoryReaderto load multiple PDFs from a folder. - Switch to other index types like
GPTListIndexfor different retrieval strategies. - Use async methods if integrating in async frameworks.
from llama_index import SimpleDirectoryReader
# Load all PDFs in a directory
loader = SimpleDirectoryReader("./pdfs", file_extractor="PyPDFLoader")
docs = loader.load_data()
index = GPTVectorStoreIndex(docs)
# Query as usual
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the documents.")
print(response.response) output
Summary of the documents is ...
Troubleshooting
- If you get
ModuleNotFoundErrorforPyPDF2, ensure it is installed withpip install PyPDF2. - If the PDF is encrypted or corrupted,
PyPDFLoadermay fail to load it. - Set
OPENAI_API_KEYcorrectly in your environment to avoid authentication errors.
Key Takeaways
- Use
PyPDFLoaderfromllama_indexto load PDF files easily. - Create a
GPTVectorStoreIndexfrom loaded documents for efficient querying. - Install
PyPDF2and setOPENAI_API_KEYin your environment to avoid errors.