Autonomous Evaluation and Refinement of Digital Agents

Jiayi Pan , Yichi Zhang , Nicholas Tomlin , Yifei Zhou , Sergey Levine , Alane Suhr

🏛 Institutions: UC Berkeley , University of Michigan
📅 Date: April 9, 2024
📑 Publisher: COLM 2024
💻 Env: Web Desktop
🔑 Keywords: automatic evaluators oracle-metric agreement inference-time guidance self-improvement digital agents

TLDR

This paper studies domain-general automatic evaluators for web-navigation and device-control agents, showing 74.4% to 92.9% agreement with oracle evaluation metrics across popular digital-agent benchmarks. It then uses those evaluators for fine-tuning and inference-time guidance, improving WebArena performance by 29% and device-control settings by around 75% relative.

Open paper arXiv Report issue