Categories
Tags
265 words
1 minutes
WebArena:A Realistic Web Environment for Building Autonomous Agents
WebArena: A Realistic Web Environment for Building Autonomous Agents
Published at ICLR 2024
Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, et al. (Carnegie Mellon University)
Project Website
π§ Background & Motivation
Current language agents are mostly tested in simplified web or virtual environments, which limits their generalization to real-world tasks. WebArena aims to provide:
- A high-fidelity web environment
- A reproducible testing platform
- Functional evaluation, focusing on whether tasks are actually accomplished rather than superficial action matching
π Environment Design
Types of Websites
- E-commerce (OneStopShop): simulates online shopping
- Forum (Reddit-like): supports posts, comments, and upvotes
- Collaborative Development (GitLab-like): for managing code repos and merge requests
- Content Management System (CMS): for editing and publishing content
Tools and Knowledge Resources
- Map
- Calculator
- Scratchpad
- Offline Wikipedia and user manuals
Technical Implementation
- Each site is deployed in Docker for reproducibility
- Observations support multiple formats: DOM tree, accessibility tree, screenshots
- Action space includes clicking, typing, switching tabs, and URL navigation
π Benchmark Task Suite
- 812 natural language tasks derived from 241 templates
- Task types:
- Information Seeking
- Site Navigation
- Content & Configuration
- All tasks require multi-step, multi-page interactions
β Evaluation Method
For Info-Seeking Tasks
Exact Match
Must Include
Fuzzy Match
(uses GPT-4 to judge semantic equivalence)
For Interaction/Configuration Tasks
- Programmatically check the page state or backend DB
- Ensure actions produce the desired functional result
π€ Baseline Results
Several LLM agents were evaluated using:
- Direct action prediction
- Chain-of-Thought (CoT) reasoning
- Whether to use βUnachievable Hintβ (UA Hint)
Model | CoT | UA Hint | Success Rate (%) |
---|---|---|---|
GPT-4 | β | β | 14.41 |
GPT-3.5 | β | β | 8.75 |
Text-Bison-001 | β | β | 5.05 |
Human | - | - | 78.24 |
π Error Analysis
- Models often incorrectly decide tasks are unachievable
- Lack recovery strategies after mistakes
- Poor generalization across similar task templates
π Comparison with Existing Work
Benchmark | Dynamic Interaction | Realistic Web | Diverse Human Tasks | Functional Evaluation |
---|---|---|---|---|
WebArena | β | β | β | β |
Mind2Web | β | β | β | β |
MiniWoB++ | β | β | β | β |
π Conclusion
WebArena offers:
- A realistic, multi-domain, multi-step web environment
- End-to-end natural language to web interaction evaluation
- Results show: Even GPT-4 struggles in realistic web tasks
Future work should focus on:
- Enhancing exploration and recovery capabilities
- Improving task planning and memory
- Strengthening generalization across templates
π Project Page: https://webarena.dev
WebArena:A Realistic Web Environment for Building Autonomous Agents
https://nanshanvv.github.io/shuchangwen-webpage/posts/paper-review/webarena/