How to beginner · 3 min read

How to extract data from Word documents

Quick answer
Use Python libraries like python-docx to extract raw text from Word documents, then process or analyze the extracted content with AI APIs such as OpenAI chat models for structured data extraction or summarization. This approach enables automated parsing and understanding of Word document contents.

PREREQUISITES

  • Python 3.8+
  • pip install python-docx openai>=1.0
  • OpenAI API key (free tier works)

Setup

Install the required Python packages to read Word documents and call AI APIs.

  • Use python-docx to parse .docx files.
  • Use the official openai Python SDK v1+ for AI processing.
bash
pip install python-docx openai>=1.0

Step by step

This example extracts text from a Word document using python-docx, then sends the text to OpenAI chat completion API to extract structured data or summarize.

python
import os
from docx import Document
from openai import OpenAI

# Load Word document text
def extract_text_from_docx(path):
    doc = Document(path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return "\n".join(full_text)

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Extract text from sample.docx
doc_text = extract_text_from_docx("sample.docx")

# Prepare prompt for structured data extraction
prompt = f"Extract key information as JSON from the following document text:\n\n{doc_text}\n\nJSON:" 

# Call OpenAI chat completion
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

# Print extracted JSON or summary
print(response.choices[0].message.content)
output
{
  "title": "Quarterly Report",
  "date": "2026-03-31",
  "total_revenue": "$1,200,000",
  "key_points": ["Revenue increased by 10%", "New product launch successful"]
}

Common variations

You can use asynchronous calls with asyncio and the OpenAI SDK's async methods for better performance on large documents. Alternatively, use different AI models like gpt-4o for higher accuracy or gpt-4o-mini for cost efficiency.

For more complex document structures, consider combining python-docx with table extraction or using specialized document AI services.

python
import asyncio
from openai import OpenAI

async def async_extract():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    doc_text = extract_text_from_docx("sample.docx")
    prompt = f"Extract key info as JSON from:\n\n{doc_text}\n\nJSON:"
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    print(response.choices[0].message.content)

asyncio.run(async_extract())
output
{
  "title": "Quarterly Report",
  "date": "2026-03-31",
  "total_revenue": "$1,200,000",
  "key_points": ["Revenue increased by 10%", "New product launch successful"]
}

Troubleshooting

  • If you get FileNotFoundError, verify the Word document path is correct.
  • If the extracted text is empty, ensure the document is a valid .docx file (not .doc or corrupted).
  • For API errors, check your OPENAI_API_KEY environment variable is set and valid.
  • If the AI output is incomplete, increase max_tokens or split large documents into smaller chunks.

Key Takeaways

  • Use python-docx to extract raw text from Word .docx files efficiently.
  • Leverage AI chat models like gpt-4o-mini to parse and extract structured data from extracted text.
  • Async API calls improve performance for large documents or batch processing.
  • Validate document format and API keys to avoid common errors.
  • Adjust model choice and token limits based on accuracy and cost needs.
Verified 2026-04 · gpt-4o-mini, gpt-4o
Verify ↗