265 words
1 minutes
WebArena:A Realistic Web Environment for Building Autonomous Agents

WebArena: A Realistic Web Environment for Building Autonomous Agents#

Published at ICLR 2024
Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, et al. (Carnegie Mellon University)
Project Website


🧠 Background & Motivation#

Current language agents are mostly tested in simplified web or virtual environments, which limits their generalization to real-world tasks. WebArena aims to provide:

  • A high-fidelity web environment
  • A reproducible testing platform
  • Functional evaluation, focusing on whether tasks are actually accomplished rather than superficial action matching

🌐 Environment Design#

Types of Websites#

  • E-commerce (OneStopShop): simulates online shopping
  • Forum (Reddit-like): supports posts, comments, and upvotes
  • Collaborative Development (GitLab-like): for managing code repos and merge requests
  • Content Management System (CMS): for editing and publishing content

Tools and Knowledge Resources#

  • Map
  • Calculator
  • Scratchpad
  • Offline Wikipedia and user manuals

Technical Implementation#

  • Each site is deployed in Docker for reproducibility
  • Observations support multiple formats: DOM tree, accessibility tree, screenshots
  • Action space includes clicking, typing, switching tabs, and URL navigation

πŸ“Š Benchmark Task Suite#

  • 812 natural language tasks derived from 241 templates
  • Task types:
    • Information Seeking
    • Site Navigation
    • Content & Configuration
  • All tasks require multi-step, multi-page interactions

βœ… Evaluation Method#

For Info-Seeking Tasks#

  • Exact Match
  • Must Include
  • Fuzzy Match (uses GPT-4 to judge semantic equivalence)

For Interaction/Configuration Tasks#

  • Programmatically check the page state or backend DB
  • Ensure actions produce the desired functional result

πŸ€– Baseline Results#

Several LLM agents were evaluated using:

  • Direct action prediction
  • Chain-of-Thought (CoT) reasoning
  • Whether to use β€œUnachievable Hint” (UA Hint)
ModelCoTUA HintSuccess Rate (%)
GPT-4βœ…βœ—14.41
GPT-3.5βœ…βœ…8.75
Text-Bison-001βœ…βœ…5.05
Human--78.24

πŸ” Error Analysis#

  • Models often incorrectly decide tasks are unachievable
  • Lack recovery strategies after mistakes
  • Poor generalization across similar task templates

πŸ†š Comparison with Existing Work#

BenchmarkDynamic InteractionRealistic WebDiverse Human TasksFunctional Evaluation
WebArenaβœ…βœ…βœ…βœ…
Mind2Webβœ—βœ…βœ…βœ—
MiniWoB++βœ…βœ—βœ—βœ…

πŸ“Œ Conclusion#

WebArena offers:

  • A realistic, multi-domain, multi-step web environment
  • End-to-end natural language to web interaction evaluation
  • Results show: Even GPT-4 struggles in realistic web tasks

Future work should focus on:

  • Enhancing exploration and recovery capabilities
  • Improving task planning and memory
  • Strengthening generalization across templates

πŸ“‚ Project Page: https://webarena.dev

WebArena:A Realistic Web Environment for Building Autonomous Agents
https://nanshanvv.github.io/shuchangwen-webpage/posts/paper-review/webarena/
Author
Shuchang Wen
Published at
2025-04-07