WebArena: A Realistic Web Environment for Building Autonomous Agents#

Published at ICLR 2024
Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, et al. (Carnegie Mellon University)
Project Website

🧠 Background & Motivation#

Current language agents are mostly tested in simplified web or virtual environments, which limits their generalization to real-world tasks. WebArena aims to provide:

A high-fidelity web environment
A reproducible testing platform
Functional evaluation, focusing on whether tasks are actually accomplished rather than superficial action matching

🌐 Environment Design#

Types of Websites#

E-commerce (OneStopShop): simulates online shopping
Forum (Reddit-like): supports posts, comments, and upvotes
Collaborative Development (GitLab-like): for managing code repos and merge requests
Content Management System (CMS): for editing and publishing content

Tools and Knowledge Resources#

Map
Calculator
Scratchpad
Offline Wikipedia and user manuals

Technical Implementation#

Each site is deployed in Docker for reproducibility
Observations support multiple formats: DOM tree, accessibility tree, screenshots
Action space includes clicking, typing, switching tabs, and URL navigation

📊 Benchmark Task Suite#

812 natural language tasks derived from 241 templates
Task types:
- Information Seeking
- Site Navigation
- Content & Configuration
All tasks require multi-step, multi-page interactions

✅ Evaluation Method#

For Info-Seeking Tasks#

Exact Match
Must Include
Fuzzy Match (uses GPT-4 to judge semantic equivalence)

For Interaction/Configuration Tasks#

Programmatically check the page state or backend DB
Ensure actions produce the desired functional result

🤖 Baseline Results#

Several LLM agents were evaluated using:

Direct action prediction
Chain-of-Thought (CoT) reasoning
Whether to use “Unachievable Hint” (UA Hint)

Model	CoT	UA Hint	Success Rate (%)
GPT-4	✅	✗	14.41
GPT-3.5	✅	✅	8.75
Text-Bison-001	✅	✅	5.05
Human	-	-	78.24

🔍 Error Analysis#

Models often incorrectly decide tasks are unachievable
Lack recovery strategies after mistakes
Poor generalization across similar task templates

🆚 Comparison with Existing Work#

Benchmark	Dynamic Interaction	Realistic Web	Diverse Human Tasks	Functional Evaluation
WebArena	✅	✅	✅	✅
Mind2Web	✗	✅	✅	✗
MiniWoB++	✅	✗	✗	✅

📌 Conclusion#

WebArena offers:

A realistic, multi-domain, multi-step web environment
End-to-end natural language to web interaction evaluation
Results show: Even GPT-4 struggles in realistic web tasks

Future work should focus on:

Enhancing exploration and recovery capabilities
Improving task planning and memory
Strengthening generalization across templates

📂 Project Page: https://webarena.dev