How to use Unstructured for document parsing
Quick answer
Use the
unstructured Python library to parse documents by installing it via pip and calling its partition functions to extract structured text from PDFs, DOCX, and other formats. This enables easy document parsing without manual text extraction.PREREQUISITES
Python 3.8+pip install unstructuredBasic Python knowledge
Setup
Install the unstructured library using pip and prepare your environment for document parsing.
pip install unstructured Step by step
Use unstructured.partition.pdf to parse a PDF document and extract its text content as structured elements.
from unstructured.partition.pdf import partition_pdf
# Path to your PDF document
file_path = "example.pdf"
# Parse the PDF document
elements = partition_pdf(filename=file_path)
# Print extracted text elements
for element in elements:
print(element.text) output
This is the first paragraph of the PDF. This is the second paragraph. ...
Common variations
You can parse other document types like DOCX or HTML by importing the corresponding partition functions from unstructured.partition. For example, use partition_docx for Word documents.
from unstructured.partition.docx import partition_docx
file_path = "example.docx"
elements = partition_docx(filename=file_path)
for element in elements:
print(element.text) output
Document title Introduction paragraph text ...
Troubleshooting
- If you see
ModuleNotFoundError, ensureunstructuredis installed in your active environment. - If parsing fails on certain PDFs, check if the file is corrupted or encrypted.
- For large documents, consider processing in chunks or increasing memory limits.
Key Takeaways
- Install the
unstructuredlibrary to parse various document formats easily. - Use specific partition functions like
partition_pdforpartition_docxfor different file types. - Parsed output is a list of structured elements with accessible text content.
- Check environment and file integrity if parsing errors occur.