How to build a PDF question answering system
Quick answer
Build a PDF question answering system by extracting text from PDFs, embedding the text into vectors using an embedding model, storing them in a vector database, and then querying with a large language model (LLM) like
gpt-4o using retrieval-augmented generation (RAG). This enables precise answers grounded in the PDF content.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install PyPDF2 faiss-cpu
Setup
Install required Python packages for PDF text extraction, vector search, and OpenAI API interaction. Set your OpenAI API key as an environment variable.
pip install openai PyPDF2 faiss-cpu Step by step
This example extracts text from a PDF, splits it into chunks, embeds each chunk with OpenAIEmbeddings, stores vectors in FAISS, and queries with gpt-4o using RAG.
import os
from PyPDF2 import PdfReader
import faiss
import numpy as np
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Step 1: Extract text from PDF
pdf_path = "sample.pdf"
reader = PdfReader(pdf_path)
text = "".join(page.extract_text() for page in reader.pages if page.extract_text())
# Step 2: Split text into chunks (simple split by 500 chars)
chunk_size = 500
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
# Step 3: Embed chunks using OpenAI embeddings
embeddings = []
for chunk in chunks:
response = client.embeddings.create(
model="text-embedding-3-large",
input=chunk
)
embeddings.append(response.data[0].embedding)
# Convert embeddings to numpy array
embedding_dim = len(embeddings[0])
embeddings_np = np.array(embeddings).astype('float32')
# Step 4: Create FAISS index and add vectors
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings_np)
# Step 5: Query function using RAG
def query_pdf(question):
# Embed question
q_embedding_resp = client.embeddings.create(
model="text-embedding-3-large",
input=question
)
q_embedding = np.array(q_embedding_resp.data[0].embedding).astype('float32')
q_embedding = q_embedding.reshape(1, -1)
# Search FAISS for top 3 relevant chunks
D, I = index.search(q_embedding, 3)
context = "\n---\n".join(chunks[i] for i in I[0])
# Build prompt with context
prompt = f"Use the following context to answer the question:\n{context}\nQuestion: {question}\nAnswer:"
# Call LLM
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Example query
print(query_pdf("What is the main topic of the document?")) output
The main topic of the document is ... (depends on PDF content)
Common variations
- Use
langchainfor advanced document loaders and vectorstore abstractions. - Switch to async calls with
asynciofor better throughput. - Use other vector databases like
ChromaorPineconefor scalable search. - Try different LLMs like
claude-3-5-sonnet-20241022for improved coding or reasoning.
Troubleshooting
- If text extraction returns empty strings, verify the PDF is not scanned images (use OCR tools like
pytesseractif needed). - If embeddings fail, check your API key and usage limits.
- For poor retrieval results, increase chunk overlap or chunk size.
- If FAISS index throws errors, ensure embeddings are float32 numpy arrays.
Key Takeaways
- Extract and chunk PDF text before embedding for efficient retrieval.
- Use vector search (FAISS) to find relevant PDF chunks for a question.
- Combine retrieved chunks with an LLM prompt for accurate answers.
- OpenAI's
gpt-4oandtext-embedding-3-largemodels enable effective RAG pipelines. - Adjust chunk size and retrieval count to optimize accuracy and cost.