The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier de Chezelles , Maxime Gasse , Alexandre Lacoste , Massimo Caccia , Alexandre Drouin , Léo Boisvert , Megh Thakkar , Tom Marty , Rim Assouel , Sahar Omidi Shayegan , Lawrence Keunho Jang , Xing Han Lù , Ori Yoran , Dehan Kong , Frank F. Xu , Siva Reddy , Graham Neubig , Quentin Cappart , Russ Salakhutdinov , Nicolas Chapados
- 🏛 Institutions
- ServiceNow Research , ServiceNow , Laval University , imean.ai , Microsoft , CMU , Polytechnique Montréal , Université de Montréal
- 📅 Date
- December 6, 2024
- 📑 Publisher
- TMLR
- 💻 Env
- Web
- 🔑 Keywords
TLDR
BrowserGym is a unified ecosystem for web-agent research that standardizes observation and action spaces while wrapping multiple existing benchmarks under one interface. The paper also introduces AgentLab for agent creation and analysis, and uses the ecosystem to run a large cross-benchmark comparison of six frontier LLMs.
Related papers (24)
- WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting PointFebruary 12, 2025 · arXiv
- WebWalker: Benchmarking LLMs in Web TraversalJanuary 13, 2025 · arXiv
- WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?March 11, 2024 · ICML 2024
- Grounding Open-Domain Instructions to Automate Web Support TasksMarch 30, 2021 · NAACL 2021
- MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent ResearchMay 25, 2026 · arXiv
- LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI AgentJanuary 26, 2026 · ICLR 2026 (Poster)
- GUITester: Enabling GUI Agents for Exploratory Defect DiscoveryJanuary 8, 2026 · arXiv
- GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI AgentMay 22, 2025 · ACL 2025
- LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration BenchmarkApril 18, 2025 · arXiv
- You Only Look at Screens: Multimodal Chain-of-Action AgentsSeptember 20, 2023 · Findings of ACL 2024
- AutoDroid: LLM-powered Task Automation in AndroidAugust 29, 2023 · MobiCom 2024
- SheetCopilot: Bringing Software Productivity to the Next Level through Large Language ModelsMay 30, 2023 · NeurIPS 2023
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- The Amazing Agent Race: Strong Tool Users, Weak NavigatorsApril 11, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game AgentsApril 8, 2026 · arXiv
- WebSP-Eval: Evaluating Web Agents on Website Security and Privacy TasksApril 7, 2026 · arXiv
- The Art of Building Verifiers for Computer Use AgentsApril 5, 2026 · arXiv
- When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web NavigationApril 1, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent VerificationMarch 27, 2026 · arXiv
- WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web TestingMarch 26, 2026 · arXiv
- Ego2Web: A Web Agent Benchmark Grounded in Egocentric VideosMarch 23, 2026 · CVPR 2026