How to beginner · 3 min read

How to load web pages with LangChain

Q: How to load web pages with LangChain

Use LangChain's WebBaseLoader to load web pages by providing the URL. It fetches and parses the page content into documents ready for processing with LangChain pipelines.

Quick answer

Use LangChain's WebBaseLoader to load web pages by providing the URL. It fetches and parses the page content into documents ready for processing with LangChain pipelines.

PREREQUISITES

Python 3.8+
pip install langchain>=0.2.0
pip install requests
OpenAI API key (free tier works) if using downstream LLMs

Setup

Install LangChain and requests for HTTP fetching. Set your OpenAI API key as an environment variable for any downstream LLM usage.

bash

pip install langchain requests

Step by step

This example demonstrates loading a web page using LangChain's WebBaseLoader. It fetches the page content and returns it as a list of Document objects.

python

from langchain_community.document_loaders import WebBaseLoader

# URL of the web page to load
url = "https://www.example.com"

# Initialize the loader
loader = WebBaseLoader(url)

# Load documents from the web page
documents = loader.load()

# Print the first 500 characters of the page content
print(documents[0].page_content[:500])

output

<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n    }\n    </style>\n</head>\n<body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples in documents.</p>\n</div>\n</body>\n</html>

Common variations

You can load multiple URLs by passing a list to WebBaseLoader. For asynchronous loading, use AsyncWebBaseLoader from LangChain. You can also combine WebBaseLoader with LangChain's text splitting and embedding tools for downstream tasks.

python

from langchain_community.document_loaders import AsyncWebBaseLoader
import asyncio

async def load_multiple_urls():
    urls = ["https://www.example.com", "https://www.python.org"]
    loader = AsyncWebBaseLoader(urls)
    documents = await loader.aload()
    for doc in documents:
        print(doc.page_content[:200])

asyncio.run(load_multiple_urls())

output

<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n    ...\n<!doctype html>\n<html lang="en">\n<head>\n    <meta charset="utf-8">\n    <title>Welcome to Python.org</title>\n    ...

Troubleshooting

If you get HTTP errors, check your internet connection and URL correctness.
For sites with JavaScript-rendered content, WebBaseLoader may not capture dynamic content; consider using a headless browser loader.
Ensure requests is installed and up to date.

✅

Key Takeaways

Use WebBaseLoader to fetch and parse static web pages easily with LangChain.
For multiple URLs or async workflows, use AsyncWebBaseLoader.
Dynamic JavaScript content requires specialized loaders beyond WebBaseLoader.
Always verify URLs and network connectivity to avoid HTTP errors.
Combine web page loading with LangChain's text processing for powerful AI workflows.

Verified 2026-04

Verify ↗