How to beginner · 4 min read

How to batch extract from multiple documents

Quick answer

Use a loop in Python to read multiple documents and send their content in batch to an AI API like OpenAI or Anthropic for extraction. Process each document's text with client.chat.completions.create or equivalent, then collect the structured outputs efficiently.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example reads multiple text documents from a folder, sends each to the OpenAI API for extraction, and prints the extracted content.

python

import os
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Folder containing documents
folder_path = "./documents"

# Function to extract text from a single document

def extract_text_from_document(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Extract the main content from this document:\n\n{text}"}]
    )
    return response.choices[0].message.content

# Batch process all text files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder_path, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            document_text = f.read()
        extracted = extract_text_from_document(document_text)
        print(f"--- Extracted from {filename} ---")
        print(extracted)
        print()

output

--- Extracted from doc1.txt ---
This document explains the key features of our product including performance and usability.

--- Extracted from doc2.txt ---
The report summarizes quarterly sales data and highlights growth opportunities.

Common variations

Use async calls with asyncio for parallel extraction to speed up batch processing.
Switch to other models like gpt-4o-mini for cost-effective extraction on large batches.
Use Anthropic SDK with client.messages.create for similar batch extraction workflows.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def extract_async(text: str) -> str:
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Extract key info:\n{text}"}]
    )
    return response.choices[0].message.content

async def main():
    folder_path = "./documents"
    tasks = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            with open(os.path.join(folder_path, filename), "r", encoding="utf-8") as f:
                text = f.read()
            tasks.append(extract_async(text))
    results = await asyncio.gather(*tasks)
    for filename, extracted in zip(os.listdir(folder_path), results):
        if filename.endswith(".txt"):
            print(f"Extracted from {filename}:\n{extracted}\n")

if __name__ == "__main__":
    asyncio.run(main())

output

Extracted from doc1.txt:
Summary of the product features and benefits.

Extracted from doc2.txt:
Quarterly sales report highlights and analysis.

Troubleshooting

If you get RateLimitError, reduce batch size or add delays between requests.
For AuthenticationError, verify your OPENAI_API_KEY environment variable is set correctly.
If extraction results are incomplete, increase max_tokens or split very large documents into smaller chunks.

✅

Key Takeaways

Batch extraction is done by looping over documents and calling the AI API per document.
Use async calls to speed up processing of large document sets.
Adjust model choice and parameters for cost and performance balance.
Always handle API rate limits and authentication errors gracefully.

Verified 2026-04 · gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗