Autonomous Evaluation and Refinement of Digital Agents
Jiayi Pan , Yichi Zhang , Nicholas Tomlin , Yifei Zhou , Sergey Levine , Alane Suhr
- 🏛 Institutions
- UC Berkeley , University of Michigan
- 📅 Date
- April 9, 2024
- 📑 Publisher
- COLM 2024
- 💻 Env
- Web Desktop
- 🔑 Keywords
TLDR
This paper studies domain-general automatic evaluators for web-navigation and device-control agents, showing 74.4% to 92.9% agreement with oracle evaluation metrics across popular digital-agent benchmarks. It then uses those evaluators for fine-tuning and inference-time guidance, improving WebArena performance by 29% and device-control settings by around 75% relative.
Related papers (24)
- OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and OptimizationOctober 25, 2024 · ACL 2025
- Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic SystemsJuly 17, 2024 · arXiv
- Large Language Models Can Self-Improve At Web Agent TasksMay 30, 2024 · arXiv
- SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI EnvironmentsMarch 10, 2026 · ICSE 2026
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI AgentsFebruary 15, 2026 · arXiv
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic ModelsDecember 18, 2025 · arXiv
- Surfer 2: The Next Generation of Cross-Platform Computer Use AgentsOctober 22, 2025 · arXiv
- UltraCUA: A Foundation Model for Computer Use Agents with Hybrid ActionOctober 20, 2025 · arXiv
- ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform DataSeptember 18, 2025 · ICLR 2026 (Oral)
- UI-Venus Technical Report: Building High-performance UI Agents with RFTAugust 14, 2025 · arXiv
- OpenCUA: Open Foundations for Computer-Use AgentsAugust 12, 2025 · NeurIPS 2025 (Spotlight)
- Test‑Time Reinforcement Learning for GUI Grounding via Region ConsistencyAugust 7, 2025 · AAAI 2026
- GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement LearningAugust 6, 2025 · ICLR 2026 (Poster)
- Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use AgentAugust 6, 2025 · arXiv
- NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation TasksAugust 4, 2025 · arXiv
- RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use AgentsMay 31, 2025 · NeurIPS 2025 (Poster)
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsMay 28, 2025 · ICLR 2026 (Oral)
- UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI AgentsMay 27, 2025 · NeurIPS 2025 (Poster)
- ScaleTrack: Scaling and back-tracking Automated GUI AgentsMay 1, 2025 · arXiv
- GUI-R1: A Generalist R1-Style Vision-Language Action Model for GUI AgentsApril 14, 2025 · arXiv
- On the Robustness of GUI Grounding Models Against Image AttacksApril 7, 2025 · arXiv
- sudo rm -rf agentic_securityMarch 26, 2025 · ACL 2025 Industry Track
- In-Context Defense in Computer Agents: An Empirical StudyMarch 12, 2025 · arXiv