You Don’t Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation
Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu
- 🏛 Institutions
- DeepWisdom, Fudan, HKUST(GZ), UC San Diego, KAUST, Westlake University, Stanford, Yale University, NUS
- 📅 Date
- August 17, 2025
- 📑 Publisher
- SEA @ NeurIPS 2025 (Poster)
- 💻 Env
- General GUI
- 🔑 Keywords
TLDR
RealDevWorld is an evaluation framework for repository-scale software generation that judges whether produced applications actually work when interacted with through their GUIs. It pairs a 194-task benchmark, RealDevBench, with AppEvalPilot, an agent-as-a-judge system for functional, visual, and runtime evaluation, and reports strong alignment with expert human assessments.
Related papers