How to beginner · 3 min read

How to use Unstructured for document chunking

Quick answer
Use the unstructured Python library to parse and chunk documents by loading them with unstructured.partition functions, which split content into manageable elements. This enables efficient document chunking for downstream AI tasks like embeddings or summarization.

PREREQUISITES

  • Python 3.8+
  • pip install unstructured
  • Basic knowledge of Python file handling

Setup

Install the unstructured library via pip to enable document parsing and chunking.

bash
pip install unstructured

Step by step

Use unstructured.partition to load and chunk a document into elements. Each element represents a chunk such as a paragraph or heading. You can then process or combine these chunks as needed.

python
from unstructured.partition.auto import partition

# Load and chunk a document file (e.g., PDF, DOCX, TXT)
chunks = partition("example_document.pdf")

# Print each chunk's text content
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:", chunk.text)
output
Chunk 1: Introduction to the document.
Chunk 2: Details about the methodology.
Chunk 3: Results and discussion.
Chunk 4: Conclusion and future work.

Common variations

  • Use partition.pdf, partition.docx, or partition.text for specific file types.
  • Combine chunks into larger blocks or split further by custom logic.
  • Integrate with AI embeddings or summarization pipelines after chunking.
python
from unstructured.partition.pdf import partition_pdf

# Specific PDF partitioning
chunks = partition_pdf("example_document.pdf")

# Example: combine first two chunks
combined_text = chunks[0].text + "\n" + chunks[1].text
print(combined_text)
output
Introduction to the document.
Details about the methodology.

Troubleshooting

  • If you see empty chunks, verify the document format is supported and not corrupted.
  • For scanned PDFs, OCR preprocessing is required before chunking.
  • Ensure unstructured dependencies like pdfminer.six or python-docx are installed for specific formats.

Key Takeaways

  • Use unstructured.partition to automatically chunk documents into logical elements.
  • Choose specific partition functions for better control over file types like PDF or DOCX.
  • Preprocess scanned documents with OCR before chunking to extract text.
  • Chunks can be combined or processed individually for AI workflows like embeddings or summarization.
Verified 2026-04
Verify ↗