How to beginner · 3 min read

How to chunk HTML documents

Quick answer
To chunk HTML documents, parse the HTML to extract meaningful text blocks, then split these blocks into smaller chunks based on size or semantic boundaries using Python libraries like BeautifulSoup and text splitting utilities. This approach ensures clean, context-aware chunks suitable for AI processing.

PREREQUISITES

  • Python 3.8+
  • pip install beautifulsoup4
  • pip install tiktoken or any text splitter library

Setup

Install the necessary Python packages to parse HTML and split text into chunks. BeautifulSoup is used for HTML parsing, and a simple text splitter function handles chunking.

bash
pip install beautifulsoup4

Step by step

This example parses an HTML document, extracts visible text, and chunks it into pieces of a maximum token length suitable for AI models.

python
from bs4 import BeautifulSoup
import os

# Example HTML content
html_content = '''
<html>
<head><title>Example</title></head>
<body>
<h1>Welcome to AI chunking</h1>
<p>This is a paragraph with some text to chunk.</p>
<p>Another paragraph with more text to demonstrate chunking.</p>
<div>Additional content in a div tag.</div>
</body>
</html>
'''

# Parse HTML and extract text
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text(separator=' ', strip=True)

# Simple chunking function by character count
# Adjust chunk_size based on your token limit (e.g., 1000 tokens ~ 4000 chars approx.)
chunk_size = 400

def chunk_text(text, size):
    return [text[i:i+size] for i in range(0, len(text), size)]

chunks = chunk_text(text, chunk_size)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{chunk}\n")
output
Chunk 1:
Example Welcome to AI chunking This is a paragraph with some text to chunk. Another paragraph with more text to demonstrate chunking. Additional content in a div tag.

Common variations

You can use token-based chunking with libraries like tiktoken for OpenAI models to better control chunk size by tokens instead of characters. Async parsing or streaming chunking is possible for large HTML files. Different AI models may require adjusting chunk sizes.

python
from bs4 import BeautifulSoup
import tiktoken

html_content = '<html><body><p>Some large HTML content here...</p></body></html>'
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text(separator=' ', strip=True)

# Initialize tokenizer for gpt-4o
enc = tiktoken.encoding_for_model('gpt-4o')
tokens = enc.encode(text)

max_tokens = 500
chunks = []
for i in range(0, len(tokens), max_tokens):
    chunk_tokens = tokens[i:i+max_tokens]
    chunk_text = enc.decode(chunk_tokens)
    chunks.append(chunk_text)

print(f"Number of chunks: {len(chunks)}")
output
Number of chunks: 1

Troubleshooting

  • If chunks are too large for your AI model, reduce chunk_size or max_tokens.
  • If HTML parsing misses text, ensure you use BeautifulSoup with the correct parser (e.g., html.parser or lxml).
  • For noisy HTML, consider cleaning scripts to remove scripts, styles, and irrelevant tags before chunking.

Key Takeaways

  • Use BeautifulSoup to extract clean text from HTML before chunking.
  • Chunk text by character or token count to fit AI model input limits.
  • Token-based chunking with tiktoken aligns chunks with model tokenization.
  • Adjust chunk size based on your target AI model's max tokens.
  • Clean HTML content to avoid irrelevant or noisy chunks.
Verified 2026-04 · gpt-4o
Verify ↗