GUI Agents Papers
Star · 821

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lù , Amirhossein Kazemnejad , Nicholas Meade , Arkil Patel , Dongchan Shin , Alejandra Zambrano , Karolina Stańczak , Peter Shaw , Christopher J. Pal , Siva Reddy

🏛 Institutions
McGill , Mila , Google DeepMind , Polytechnique Montréal , ServiceNow Research
📅 Date
April 11, 2025
📑 Publisher
COLM 2025
💻 Env
Web
🔑 Keywords
TLDR

AgentRewardBench evaluates automatic judging of web-agent trajectories rather than the agents themselves. It collects 1,302 expert-reviewed trajectories across five benchmarks and shows that no single LLM judge dominates across settings, while commonly used rule-based evaluators often underreport true success.

Open paper arXiv Report issue
Related papers (24)