How to Intermediate · 4 min read

How to build a web scraping tool for AI agent

Quick answer
Build a web scraping tool for an AI agent by combining Python libraries like requests and BeautifulSoup to extract web data, then use an AI model like gpt-4o via the OpenAI API to analyze or summarize the scraped content. Automate the workflow by feeding scraped HTML or text to the AI agent for intelligent processing.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 requests beautifulsoup4

Setup environment

Install required Python packages and set your OpenAI API key as an environment variable for secure access.

bash
pip install openai requests beautifulsoup4

Step by step code

This example scrapes the HTML content of a webpage, extracts text using BeautifulSoup, then sends it to the gpt-4o model to summarize the content.

python
import os
import requests
from bs4 import BeautifulSoup
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Scrape webpage
url = "https://example.com"
response = requests.get(url)
response.raise_for_status()

# Step 2: Parse HTML and extract text
soup = BeautifulSoup(response.text, "html.parser")
text_content = soup.get_text(separator=" ", strip=True)

# Step 3: Use AI agent to summarize
prompt = f"Summarize the following webpage content:\n\n{text_content[:2000]}"  # Limit input size

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

summary = response.choices[0].message.content
print("Summary of webpage content:")
print(summary)
output
Summary of webpage content:
[AI-generated summary of the scraped webpage text]

Common variations

  • Use asyncio with httpx for asynchronous scraping.
  • Switch to other models like claude-3-5-sonnet-20241022 for better summarization.
  • Implement streaming responses for real-time AI output.
  • Expand scraping to multiple pages with pagination logic.

Troubleshooting tips

  • If you get HTTP errors, verify the URL and your internet connection.
  • For API errors, check your OpenAI API key and usage limits.
  • Limit the input text size to avoid token limits by truncating or summarizing before sending to the AI.
  • Handle parsing errors by inspecting the HTML structure or using more robust selectors.

Key Takeaways

  • Use Python libraries requests and BeautifulSoup to scrape and parse web content efficiently.
  • Integrate scraped data with AI models like gpt-4o for intelligent summarization or analysis.
  • Always manage API keys securely via environment variables and handle token limits by truncating input.
  • Consider asynchronous scraping and streaming AI responses for scalable, real-time applications.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗