How to load web pages with LangChain
Quick answer
Use LangChain's
WebBaseLoader to load web pages by providing the URL. It fetches and parses the page content into documents ready for processing with LangChain pipelines.PREREQUISITES
Python 3.8+pip install langchain>=0.2.0pip install requestsOpenAI API key (free tier works) if using downstream LLMs
Setup
Install LangChain and requests for HTTP fetching. Set your OpenAI API key as an environment variable for any downstream LLM usage.
pip install langchain requests Step by step
This example demonstrates loading a web page using LangChain's WebBaseLoader. It fetches the page content and returns it as a list of Document objects.
from langchain_community.document_loaders import WebBaseLoader
# URL of the web page to load
url = "https://www.example.com"
# Initialize the loader
loader = WebBaseLoader(url)
# Load documents from the web page
documents = loader.load()
# Print the first 500 characters of the page content
print(documents[0].page_content[:500]) output
<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n <meta charset="utf-8" />\n <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n <meta name="viewport" content="width=device-width, initial-scale=1" />\n <style type="text/css">\n body {\n background-color: #f0f0f2;\n margin: 0;\n padding: 0;\n font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n }\n </style>\n</head>\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples in documents.</p>\n</div>\n</body>\n</html> Common variations
You can load multiple URLs by passing a list to WebBaseLoader. For asynchronous loading, use AsyncWebBaseLoader from LangChain. You can also combine WebBaseLoader with LangChain's text splitting and embedding tools for downstream tasks.
from langchain_community.document_loaders import AsyncWebBaseLoader
import asyncio
async def load_multiple_urls():
urls = ["https://www.example.com", "https://www.python.org"]
loader = AsyncWebBaseLoader(urls)
documents = await loader.aload()
for doc in documents:
print(doc.page_content[:200])
asyncio.run(load_multiple_urls()) output
<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n ...\n<!doctype html>\n<html lang="en">\n<head>\n <meta charset="utf-8">\n <title>Welcome to Python.org</title>\n ...
Troubleshooting
- If you get HTTP errors, check your internet connection and URL correctness.
- For sites with JavaScript-rendered content,
WebBaseLoadermay not capture dynamic content; consider using a headless browser loader. - Ensure
requestsis installed and up to date.
Key Takeaways
- Use
WebBaseLoaderto fetch and parse static web pages easily with LangChain. - For multiple URLs or async workflows, use
AsyncWebBaseLoader. - Dynamic JavaScript content requires specialized loaders beyond
WebBaseLoader. - Always verify URLs and network connectivity to avoid HTTP errors.
- Combine web page loading with LangChain's text processing for powerful AI workflows.