How to beginner · 4 min read

How to index PDF documents for RAG

Q: How to index PDF documents for RAG

To index PDF documents for RAG, extract text from PDFs using libraries like PyPDF2 or pdfplumber, then generate embeddings with OpenAI embedding models such as text-embedding-3-small. Store these embeddings in a vector database like FAISS to enable efficient similarity search during retrieval.

Quick answer

To index PDF documents for RAG, extract text from PDFs using libraries like PyPDF2 or pdfplumber, then generate embeddings with OpenAI embedding models such as text-embedding-3-small. Store these embeddings in a vector database like FAISS to enable efficient similarity search during retrieval.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install PyPDF2 faiss-cpu

Setup

Install required Python packages for PDF text extraction, embeddings, and vector indexing.

bash

pip install openai PyPDF2 faiss-cpu

Step by step

This example extracts text from a PDF, splits it into chunks, generates embeddings using OpenAI, and indexes them with FAISS for RAG.

python

import os
from openai import OpenAI
import PyPDF2
import faiss
import numpy as np

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Function to extract text from PDF

def extract_text_from_pdf(pdf_path):
    text = []
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            text.append(page.extract_text())
    return "\n".join(text)

# Simple text splitter by paragraphs

def split_text(text, max_chunk_size=500):
    paragraphs = text.split("\n")
    chunks = []
    current_chunk = ""
    for para in paragraphs:
        if len(current_chunk) + len(para) + 1 > max_chunk_size:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para
        else:
            current_chunk += " " + para
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

# Generate embeddings for a list of texts

def get_embeddings(texts):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [data.embedding for data in response.data]

# Index embeddings with FAISS

def create_faiss_index(embeddings):
    dimension = len(embeddings[0])
    index = faiss.IndexFlatL2(dimension)  # L2 distance
    index.add(np.array(embeddings).astype('float32'))
    return index

# Example usage

pdf_path = "sample.pdf"  # Replace with your PDF file path
text = extract_text_from_pdf(pdf_path)
chunks = split_text(text)
embeddings = get_embeddings(chunks)
index = create_faiss_index(embeddings)

print(f"Indexed {len(chunks)} chunks from {pdf_path} for RAG.")

output

Indexed 15 chunks from sample.pdf for RAG.

Common variations

Use pdfplumber for more accurate PDF text extraction.
Use async calls with asyncio and OpenAI client for large batch embedding generation.
Store embeddings in cloud vector DBs like Pinecone or Chroma instead of local FAISS.
Use different embedding models like text-embedding-3-large for higher quality.

Troubleshooting

If PDF text extraction returns empty strings, try switching to pdfplumber or check if the PDF is scanned (image-based).
If embeddings generation fails, verify your OPENAI_API_KEY environment variable is set correctly.
If FAISS index throws dimension errors, ensure all embeddings have consistent vector size.

✅

Key Takeaways

Extract text from PDFs using reliable libraries like PyPDF2 or pdfplumber before embedding.
Generate embeddings with OpenAI embedding models and store them in a vector database for fast similarity search.
Use chunking to split large documents into manageable pieces for better retrieval accuracy.
Local vector stores like FAISS are easy to set up; cloud vector DBs offer scalability.
Always verify environment variables and embedding dimensions to avoid runtime errors.

Verified 2026-04 · text-embedding-3-small, gpt-4o-mini

Verify ↗