How to Intermediate · 4 min read

How to build a PDF question answering system

Quick answer
Build a PDF question answering system by extracting text from PDFs, embedding the text into vectors using an embedding model, storing them in a vector database, and then querying with a large language model (LLM) like gpt-4o using retrieval-augmented generation (RAG). This enables precise answers grounded in the PDF content.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install PyPDF2 faiss-cpu

Setup

Install required Python packages for PDF text extraction, vector search, and OpenAI API interaction. Set your OpenAI API key as an environment variable.

bash
pip install openai PyPDF2 faiss-cpu

Step by step

This example extracts text from a PDF, splits it into chunks, embeds each chunk with OpenAIEmbeddings, stores vectors in FAISS, and queries with gpt-4o using RAG.

python
import os
from PyPDF2 import PdfReader
import faiss
import numpy as np
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Extract text from PDF
pdf_path = "sample.pdf"
reader = PdfReader(pdf_path)
text = "".join(page.extract_text() for page in reader.pages if page.extract_text())

# Step 2: Split text into chunks (simple split by 500 chars)
chunk_size = 500
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Step 3: Embed chunks using OpenAI embeddings
embeddings = []
for chunk in chunks:
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=chunk
    )
    embeddings.append(response.data[0].embedding)

# Convert embeddings to numpy array
embedding_dim = len(embeddings[0])
embeddings_np = np.array(embeddings).astype('float32')

# Step 4: Create FAISS index and add vectors
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings_np)

# Step 5: Query function using RAG

def query_pdf(question):
    # Embed question
    q_embedding_resp = client.embeddings.create(
        model="text-embedding-3-large",
        input=question
    )
    q_embedding = np.array(q_embedding_resp.data[0].embedding).astype('float32')
    q_embedding = q_embedding.reshape(1, -1)

    # Search FAISS for top 3 relevant chunks
    D, I = index.search(q_embedding, 3)
    context = "\n---\n".join(chunks[i] for i in I[0])

    # Build prompt with context
    prompt = f"Use the following context to answer the question:\n{context}\nQuestion: {question}\nAnswer:" 

    # Call LLM
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Example query
print(query_pdf("What is the main topic of the document?"))
output
The main topic of the document is ... (depends on PDF content)

Common variations

  • Use langchain for advanced document loaders and vectorstore abstractions.
  • Switch to async calls with asyncio for better throughput.
  • Use other vector databases like Chroma or Pinecone for scalable search.
  • Try different LLMs like claude-3-5-sonnet-20241022 for improved coding or reasoning.

Troubleshooting

  • If text extraction returns empty strings, verify the PDF is not scanned images (use OCR tools like pytesseract if needed).
  • If embeddings fail, check your API key and usage limits.
  • For poor retrieval results, increase chunk overlap or chunk size.
  • If FAISS index throws errors, ensure embeddings are float32 numpy arrays.

Key Takeaways

  • Extract and chunk PDF text before embedding for efficient retrieval.
  • Use vector search (FAISS) to find relevant PDF chunks for a question.
  • Combine retrieved chunks with an LLM prompt for accurate answers.
  • OpenAI's gpt-4o and text-embedding-3-large models enable effective RAG pipelines.
  • Adjust chunk size and retrieval count to optimize accuracy and cost.
Verified 2026-04 · gpt-4o, text-embedding-3-large, claude-3-5-sonnet-20241022
Verify ↗