How to beginner · 3 min read

How to load web page in LangChain

Quick answer
Use LangChain's WebBaseLoader from langchain.document_loaders to load web pages by providing the URL. This loader fetches and parses the page content into documents ready for processing with LangChain chains or embeddings.

PREREQUISITES

  • Python 3.8+
  • pip install langchain>=0.2
  • pip install requests
  • OpenAI API key (for downstream usage)

Setup

Install LangChain and requests to enable web page loading and HTTP requests.

Set your OpenAI API key as an environment variable for LangChain usage.

bash
pip install langchain requests

# On Linux/macOS
export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

# On Windows (PowerShell)
setx OPENAI_API_KEY os.environ["OPENAI_API_KEY"]

Step by step

Use WebBaseLoader to load a web page URL into LangChain documents. This example fetches the content and prints the first 500 characters.

python
from langchain_community.document_loaders import WebBaseLoader

# URL of the web page to load
url = "https://www.example.com"

# Initialize the loader
loader = WebBaseLoader(url)

# Load documents (list of Document objects)
docs = loader.load()

# Print the first 500 characters of the page content
print(docs[0].page_content[:500])
output
<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n    }\n    ...

Common variations

You can load multiple URLs by passing a list to WebBaseLoader. For asynchronous loading, use AsyncWebBaseLoader from langchain.document_loaders. To customize parsing, consider subclassing the loader or using other loaders like UnstructuredURLLoader.

python
from langchain_community.document_loaders import AsyncWebBaseLoader
import asyncio

async def load_multiple_urls():
    urls = ["https://www.example.com", "https://www.python.org"]
    loader = AsyncWebBaseLoader(urls)
    docs = await loader.aload()
    for doc in docs:
        print(doc.page_content[:200])

asyncio.run(load_multiple_urls())
output
<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n    ...\n<!doctype html>\n<html lang="en">\n<head>\n    <meta charset="utf-8">\n    <title>Welcome to Python.org</title>\n    ...

Troubleshooting

  • If you get HTTP errors, check your internet connection and URL validity.
  • For SSL errors, ensure your Python environment has up-to-date certificates.
  • If content is empty, verify the page does not require JavaScript rendering (LangChain loaders do not execute JS).

Key Takeaways

  • Use WebBaseLoader to easily load web pages into LangChain documents.
  • For multiple URLs or async loading, use AsyncWebBaseLoader.
  • LangChain loaders do not execute JavaScript; use other tools if JS rendering is needed.
Verified 2026-04
Verify ↗