How to beginner · 3 min read

How to use Unstructured for document parsing

Quick answer
Use the unstructured Python library to parse documents by installing it via pip and calling its partition functions to extract structured text from PDFs, DOCX, and other formats. This enables easy document parsing without manual text extraction.

PREREQUISITES

  • Python 3.8+
  • pip install unstructured
  • Basic Python knowledge

Setup

Install the unstructured library using pip and prepare your environment for document parsing.

bash
pip install unstructured

Step by step

Use unstructured.partition.pdf to parse a PDF document and extract its text content as structured elements.

python
from unstructured.partition.pdf import partition_pdf

# Path to your PDF document
file_path = "example.pdf"

# Parse the PDF document
elements = partition_pdf(filename=file_path)

# Print extracted text elements
for element in elements:
    print(element.text)
output
This is the first paragraph of the PDF.
This is the second paragraph.
...

Common variations

You can parse other document types like DOCX or HTML by importing the corresponding partition functions from unstructured.partition. For example, use partition_docx for Word documents.

python
from unstructured.partition.docx import partition_docx

file_path = "example.docx"
elements = partition_docx(filename=file_path)

for element in elements:
    print(element.text)
output
Document title
Introduction paragraph text
...

Troubleshooting

  • If you see ModuleNotFoundError, ensure unstructured is installed in your active environment.
  • If parsing fails on certain PDFs, check if the file is corrupted or encrypted.
  • For large documents, consider processing in chunks or increasing memory limits.

Key Takeaways

  • Install the unstructured library to parse various document formats easily.
  • Use specific partition functions like partition_pdf or partition_docx for different file types.
  • Parsed output is a list of structured elements with accessible text content.
  • Check environment and file integrity if parsing errors occur.
Verified 2026-04
Verify ↗