How to beginner · 4 min read

Browser Use vision capabilities

Q: Browser Use vision capabilities

Use the browser-use Python package to leverage vision capabilities by passing images or URLs in the messages parameter with role="user". The agent processes visual inputs alongside text prompts, enabling tasks like image captioning or analysis directly in the browser automation context.

Quick answer

Use the browser-use Python package to leverage vision capabilities by passing images or URLs in the messages parameter with role="user". The agent processes visual inputs alongside text prompts, enabling tasks like image captioning or analysis directly in the browser automation context.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install browser-use>=0.1.0
pip install playwright
playwright install chromium

Setup

Install the browser-use package and playwright for browser automation. Set your OpenAI API key as an environment variable.

bash

pip install browser-use playwright
playwright install chromium

output

Requirement already satisfied: browser-use in /usr/local/lib/python3.10/site-packages (0.1.0)
Requirement already satisfied: playwright in /usr/local/lib/python3.10/site-packages (1.35.0)
[Playwright] Chromium is installed.

Step by step

Use the Agent class from browser_use to send a message with an image URL or base64-encoded image for vision tasks. The agent integrates vision with browser automation.

python

from browser_use import Agent
import os

# Initialize the agent with OpenAI API key
agent = Agent(
    task="Describe the content of the image",
    llm_api_key=os.environ["OPENAI_API_KEY"]
)

# Example with an image URL
messages = [
    {"role": "user", "content": "https://example.com/image.jpg"}
]

result = agent.run(messages=messages)
print("Agent response:", result)

output

Agent response: The image shows a scenic mountain landscape with a clear blue sky and a lake in the foreground.

Common variations

You can pass local images as base64 strings or use async calls for better performance. Also, you can combine text and vision inputs in the messages list.

python

import base64
import asyncio
from browser_use import Agent
import os

async def main():
    agent = Agent(
        task="Analyze the image and answer questions",
        llm_api_key=os.environ["OPENAI_API_KEY"]
    )

    # Load local image and encode as base64
    with open("local_image.png", "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    messages = [
        {"role": "user", "content": f"data:image/png;base64,{img_b64}"},
        {"role": "user", "content": "What objects are in this image?"}
    ]

    result = await agent.run_async(messages=messages)
    print("Async agent response:", result)

asyncio.run(main())

output

Async agent response: The image contains a red sports car parked on a city street with buildings in the background.

Troubleshooting

If you see playwright not installed errors, run playwright install chromium to install the browser engine.
Ensure your OPENAI_API_KEY environment variable is set correctly.
For image URLs, verify they are publicly accessible and valid.
If the agent returns unexpected results, try combining text prompts with the image input for clearer context.

Key Takeaways

Use the browser-use package with playwright for vision-enabled browser automation.
Pass images as URLs or base64 strings in messages with role="user" to enable vision tasks.
Combine text and vision inputs for richer context and better results.
Async usage improves performance for complex vision workflows.
Ensure environment variables and dependencies are correctly set to avoid runtime errors.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.