OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
Zhenyu Wu , Jingjing Xie , Zehao Li , Bowen Yang , Qiushi Sun , Zhaoyang Liu , Zhoumianze Liu , Yu Qiao , Xiangyu Yue , Zun Wang , Zichen Ding
- 🏛 Institutions
- SJTU , Shanghai AI Laboratory , CUHK MMLab , HKU , HKUST
- 📅 Date
- December 18, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop Mobile Web
- 🔑 Keywords
TLDR
OS-Oracle targets step-level action criticism for computer-use agents with a 310k-sample cross-platform training pipeline, a two-stage SFT plus CP-GRPO recipe, and the OS-Critic Bench benchmark. The resulting 7B critic reaches state of the art among open-source VLM critics and improves downstream GUI agents when used as a pre-critic.
Related papers (24)
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- On the Robustness of GUI Grounding Models Against Image AttacksApril 7, 2025 · arXiv
- GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented UnderstandingJune 16, 2024 · ICLR 2025 (Poster)
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI AgentsJanuary 17, 2024 · ACL 2024
- PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation AgentsMarch 9, 2026 · arXiv
- NaturalGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory DatasetAugust 2, 2025 · arXiv
- RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use AgentsMay 31, 2025 · NeurIPS 2025 (Poster)
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsMay 28, 2025 · ICLR 2026 (Oral)
- WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting PointFebruary 12, 2025 · arXiv
- CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsJuly 1, 2024 · Findings of ACL 2025
- GUI Action Narrator: Where and When Did That Action Take Place?June 19, 2024 · arXiv
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsApril 11, 2024 · NeurIPS 2024 Datasets and Benchmarks Track
- OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and WebFebruary 29, 2024 · ECCV 2024 (Poster)
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsJune 9, 2026 · arXiv
- Benchmarking Living-Screen-Native GUI Agents on Short-Video PlatformsJune 3, 2026 · arXiv
- AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source ApplicationsMay 26, 2026 · arXiv
- MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent ResearchMay 25, 2026 · arXiv
- SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent BenchmarkingMay 24, 2026 · arXiv
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsApril 12, 2026 · arXiv
- The Amazing Agent Race: Strong Tool Users, Weak NavigatorsApril 11, 2026 · arXiv
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv