GUI Agents Papers
Star · 751

You Don’t Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation

Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu

🏛 Institutions
DeepWisdom, Fudan, HKUST(GZ), UC San Diego, KAUST, Westlake University, Stanford, Yale University, NUS
📅 Date
August 17, 2025
📑 Publisher
SEA @ NeurIPS 2025 (Poster)
💻 Env
General GUI
🔑 Keywords
TLDR

RealDevWorld is an evaluation framework for repository-scale software generation that judges whether produced applications actually work when interacted with through their GUIs. It pairs a 194-task benchmark, RealDevBench, with AppEvalPilot, an agent-as-a-judge system for functional, visual, and runtime evaluation, and reports strong alignment with expert human assessments.

Open paper arXiv Edit on GitHub Report issue
Related papers