How to batch extract from multiple documents
Quick answer
Use a loop in Python to read multiple documents and send their content in batch to an AI API like OpenAI or Anthropic for extraction. Process each document's text with client.chat.completions.create or equivalent, then collect the structured outputs efficiently.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai>=1.0 output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example reads multiple text documents from a folder, sends each to the OpenAI API for extraction, and prints the extracted content.
import os
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Folder containing documents
folder_path = "./documents"
# Function to extract text from a single document
def extract_text_from_document(text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Extract the main content from this document:\n\n{text}"}]
)
return response.choices[0].message.content
# Batch process all text files in the folder
for filename in os.listdir(folder_path):
if filename.endswith(".txt"):
file_path = os.path.join(folder_path, filename)
with open(file_path, "r", encoding="utf-8") as f:
document_text = f.read()
extracted = extract_text_from_document(document_text)
print(f"--- Extracted from {filename} ---")
print(extracted)
print() output
--- Extracted from doc1.txt --- This document explains the key features of our product including performance and usability. --- Extracted from doc2.txt --- The report summarizes quarterly sales data and highlights growth opportunities.
Common variations
- Use
asynccalls withasynciofor parallel extraction to speed up batch processing. - Switch to other models like
gpt-4o-minifor cost-effective extraction on large batches. - Use
AnthropicSDK withclient.messages.createfor similar batch extraction workflows.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def extract_async(text: str) -> str:
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Extract key info:\n{text}"}]
)
return response.choices[0].message.content
async def main():
folder_path = "./documents"
tasks = []
for filename in os.listdir(folder_path):
if filename.endswith(".txt"):
with open(os.path.join(folder_path, filename), "r", encoding="utf-8") as f:
text = f.read()
tasks.append(extract_async(text))
results = await asyncio.gather(*tasks)
for filename, extracted in zip(os.listdir(folder_path), results):
if filename.endswith(".txt"):
print(f"Extracted from {filename}:\n{extracted}\n")
if __name__ == "__main__":
asyncio.run(main()) output
Extracted from doc1.txt: Summary of the product features and benefits. Extracted from doc2.txt: Quarterly sales report highlights and analysis.
Troubleshooting
- If you get
RateLimitError, reduce batch size or add delays between requests. - For
AuthenticationError, verify yourOPENAI_API_KEYenvironment variable is set correctly. - If extraction results are incomplete, increase
max_tokensor split very large documents into smaller chunks.
Key Takeaways
- Batch extraction is done by looping over documents and calling the AI API per document.
- Use async calls to speed up processing of large document sets.
- Adjust model choice and parameters for cost and performance balance.
- Always handle API rate limits and authentication errors gracefully.