GUI Agents Papers
Star · 751

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy

🏛 Institutions
McGill, Mila, Google DeepMind, Polytechnique Montréal, ServiceNow Research
📅 Date
April 11, 2025
📑 Publisher
COLM 2025
💻 Env
Web
🔑 Keywords
TLDR

AgentRewardBench evaluates automatic judging of web-agent trajectories rather than the agents themselves. It collects 1,302 expert-reviewed trajectories across five benchmarks and shows that no single LLM judge dominates across settings, while commonly used rule-based evaluators often underreport true success.

Open paper arXiv Edit on GitHub Report issue
Related papers