How to index PDF documents for RAG
Quick answer
To index PDF documents for
RAG, extract text from PDFs using libraries like PyPDF2 or pdfplumber, then generate embeddings with OpenAI embedding models such as text-embedding-3-small. Store these embeddings in a vector database like FAISS to enable efficient similarity search during retrieval.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install PyPDF2 faiss-cpu
Setup
Install required Python packages for PDF text extraction, embeddings, and vector indexing.
pip install openai PyPDF2 faiss-cpu Step by step
This example extracts text from a PDF, splits it into chunks, generates embeddings using OpenAI, and indexes them with FAISS for RAG.
import os
from openai import OpenAI
import PyPDF2
import faiss
import numpy as np
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
text = []
with open(pdf_path, "rb") as f:
reader = PyPDF2.PdfReader(f)
for page in reader.pages:
text.append(page.extract_text())
return "\n".join(text)
# Simple text splitter by paragraphs
def split_text(text, max_chunk_size=500):
paragraphs = text.split("\n")
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) + 1 > max_chunk_size:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para
else:
current_chunk += " " + para
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
# Generate embeddings for a list of texts
def get_embeddings(texts):
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [data.embedding for data in response.data]
# Index embeddings with FAISS
def create_faiss_index(embeddings):
dimension = len(embeddings[0])
index = faiss.IndexFlatL2(dimension) # L2 distance
index.add(np.array(embeddings).astype('float32'))
return index
# Example usage
pdf_path = "sample.pdf" # Replace with your PDF file path
text = extract_text_from_pdf(pdf_path)
chunks = split_text(text)
embeddings = get_embeddings(chunks)
index = create_faiss_index(embeddings)
print(f"Indexed {len(chunks)} chunks from {pdf_path} for RAG.") output
Indexed 15 chunks from sample.pdf for RAG.
Common variations
- Use
pdfplumberfor more accurate PDF text extraction. - Use async calls with
asyncioandOpenAIclient for large batch embedding generation. - Store embeddings in cloud vector DBs like
PineconeorChromainstead of localFAISS. - Use different embedding models like
text-embedding-3-largefor higher quality.
Troubleshooting
- If PDF text extraction returns empty strings, try switching to
pdfplumberor check if the PDF is scanned (image-based). - If embeddings generation fails, verify your
OPENAI_API_KEYenvironment variable is set correctly. - If
FAISSindex throws dimension errors, ensure all embeddings have consistent vector size.
Key Takeaways
- Extract text from PDFs using reliable libraries like
PyPDF2orpdfplumberbefore embedding. - Generate embeddings with
OpenAIembedding models and store them in a vector database for fast similarity search. - Use chunking to split large documents into manageable pieces for better retrieval accuracy.
- Local vector stores like
FAISSare easy to set up; cloud vector DBs offer scalability. - Always verify environment variables and embedding dimensions to avoid runtime errors.