How to beginner · 3 min read

How to load web pages with LlamaIndex

Quick answer
Use LlamaIndex's WebPageReader to load and parse web pages easily. Instantiate the reader, pass the URL(s), and it returns documents ready for indexing or querying.

PREREQUISITES

  • Python 3.8+
  • pip install llama-index>=0.6.10
  • pip install requests
  • OpenAI API key (free tier works) if using LlamaIndex with OpenAI models

Setup

Install llama-index and requests to enable web page loading. Set your OpenAI API key as an environment variable for downstream usage.

bash
pip install llama-index requests

Step by step

This example shows how to load a web page URL using WebPageReader from LlamaIndex, then print the extracted text content.

python
import os
from llama_index import WebPageReader

# Set your OpenAI API key in environment before running
# export OPENAI_API_KEY=os.environ["OPENAI_API_KEY"]

# Instantiate the web page reader
reader = WebPageReader()

# URL to load
url = 'https://www.example.com'

# Load the web page content as documents
documents = reader.load_data(urls=[url])

# Print the extracted text from the first document
print(documents[0].text)
output
Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

Common variations

  • Load multiple URLs by passing a list of URLs to load_data.
  • Use BeautifulSoupWebReader for more advanced HTML parsing.
  • Combine with LlamaIndex's GPTVectorStoreIndex for semantic search over web content.
python
from llama_index import BeautifulSoupWebReader

# Using BeautifulSoupWebReader for richer parsing
bs_reader = BeautifulSoupWebReader()
docs = bs_reader.load_data(urls=['https://www.example.com', 'https://www.python.org'])
for doc in docs:
    print(doc.text[:200])  # print first 200 chars
output
Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

Welcome to Python.org

The official home of the Python Programming Language...

Troubleshooting

  • If you get a requests.exceptions.ConnectionError, check your internet connection and URL correctness.
  • For SSL errors, ensure your Python environment has up-to-date certificates.
  • If no text is extracted, verify the page is not heavily JavaScript-rendered (LlamaIndex loaders do not execute JS).

Key Takeaways

  • Use WebPageReader from LlamaIndex to load web pages by URL simply.
  • Pass a list of URLs to load_data to batch load multiple pages.
  • For richer HTML parsing, use BeautifulSoupWebReader.
  • LlamaIndex loaders do not execute JavaScript; use static HTML pages or pre-rendered content.
  • Always set your API keys securely via environment variables.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗