SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments
Syed Yusuf Ahmed , Shiwei Feng , Chanwoo Bae , Calix Barrus Xiangyu Zhang
- 🏛 Institutions
- Purdue University , University of Texas at San Antonio
- 📅 Date
- March 10, 2026
- 📑 Publisher
- ICSE 2026
- 💻 Env
- Desktop Web
- 🔑 Keywords
TLDR
SpecOps is a fully automated testing framework that uses four specialist agents to generate cases, set up environments, execute tasks, and validate outcomes for real-world software agents. Across five deployed agents spanning CLI tools, web apps, and browser extensions, it finds 164 true bugs with 0.89 F1 while keeping each test under eight minutes and under $0.73.
Related papers (24)
- Towards Automated Crowdsourced Testing via Personified-LLMMarch 25, 2026 · FSE 2026
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI AgentsFebruary 15, 2026 · arXiv
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic ModelsDecember 18, 2025 · arXiv
- Surfer 2: The Next Generation of Cross-Platform Computer Use AgentsOctober 22, 2025 · arXiv
- UltraCUA: A Foundation Model for Computer Use Agents with Hybrid ActionOctober 20, 2025 · arXiv
- ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform DataSeptember 18, 2025 · ICLR 2026 (Oral)
- UI-Venus Technical Report: Building High-performance UI Agents with RFTAugust 14, 2025 · arXiv
- OpenCUA: Open Foundations for Computer-Use AgentsAugust 12, 2025 · NeurIPS 2025 (Spotlight)
- Test‑Time Reinforcement Learning for GUI Grounding via Region ConsistencyAugust 7, 2025 · AAAI 2026
- GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement LearningAugust 6, 2025 · ICLR 2026 (Poster)
- Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use AgentAugust 6, 2025 · arXiv
- NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation TasksAugust 4, 2025 · arXiv
- RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use AgentsMay 31, 2025 · NeurIPS 2025 (Poster)
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsMay 28, 2025 · ICLR 2026 (Oral)
- ScaleTrack: Scaling and back-tracking Automated GUI AgentsMay 1, 2025 · arXiv
- GUI-R1: A Generalist R1-Style Vision-Language Action Model for GUI AgentsApril 14, 2025 · arXiv
- On the Robustness of GUI Grounding Models Against Image AttacksApril 7, 2025 · arXiv
- sudo rm -rf agentic_securityMarch 26, 2025 · ACL 2025 Industry Track
- In-Context Defense in Computer Agents: An Empirical StudyMarch 12, 2025 · arXiv
- SpiritSight Agent: Advanced GUI Agent with One LookMarch 5, 2025 · CVPR 2025 (Poster)
- WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting PointFebruary 12, 2025 · arXiv
- UI-TARS: Pioneering Automated GUI Interaction with Native AgentsJanuary 21, 2025 · arXiv
- Iris: Breaking GUI Complexity with Adaptive Focus and Self-RefiningDecember 13, 2024 · arXiv