OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks
Jing Wu , Daphne Barretto , Yiye Chen , Nicholas Gydé , Yanan Jian , Yuhang He , Vibhav Vineet
- 🏛 Institutions
- Oxford , Microsoft , Georgia Tech
- 📅 Date
- January 28, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop
- 🔑 Keywords
TLDR
OS-Marathon benchmarks computer-use agents on 242 long-horizon repetitive desktop workflows such as expense processing and grade entry. The paper also introduces a few-shot condensed-demonstration method for teaching the recurring sub-workflow logic behind these tasks.
Related papers (24)
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsJune 9, 2026 · arXiv
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent BenchmarkingMay 24, 2026 · arXiv
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI AgentsMarch 19, 2026 · arXiv
- MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World EnvironmentJanuary 28, 2026 · arXiv
- LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI AgentJanuary 26, 2026 · ICLR 2026 (Poster)
- MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented EnvironmentsDecember 22, 2025 · arXiv
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?October 21, 2024 · EMNLP 2024 (Poster)
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsApril 12, 2026 · arXiv
- GPA: Learning GUI Process Automation from DemonstrationsApril 2, 2026 · arXiv
- HippoCamp: Benchmarking Contextual Agents on Personal ComputersApril 1, 2026 · arXiv
- PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation AgentsMarch 9, 2026 · arXiv
- OSExpert: Computer-Use Agents Learning Professional Skills via ExplorationMarch 9, 2026 · arXiv
- When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use AgentsFebruary 9, 2026 · arXiv
- When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use AgentsFebruary 9, 2026 · arXiv
- EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI AgentsJanuary 25, 2026 · arXiv
- MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning CorrectionJanuary 19, 2026 · arXiv
- ShowUI-π: Flow-based Generative Models as GUI Dexterous HandsDecember 31, 2025 · arXiv
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic ModelsDecember 18, 2025 · arXiv