AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lù , Amirhossein Kazemnejad , Nicholas Meade , Arkil Patel , Dongchan Shin , Alejandra Zambrano , Karolina Stańczak , Peter Shaw , Christopher J. Pal , Siva Reddy

🏛 Institutions: McGill , Mila , Google DeepMind , Polytechnique Montréal , ServiceNow Research
📅 Date: April 11, 2025
📑 Publisher: COLM 2025
💻 Env: Web
🔑 Keywords: benchmark dataset trajectory evaluation LLM judges rule-based evaluation AgentRewardBench

TLDR

AgentRewardBench evaluates automatic judging of web-agent trajectories rather than the agents themselves. It collects 1,302 expert-reviewed trajectories across five benchmarks and shows that no single LLM judge dominates across settings, while commonly used rule-based evaluators often underreport true success.

Open paper arXiv Report issue