OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

Jing Wu , Daphne Barretto , Yiye Chen , Nicholas Gydé , Yanan Jian , Yuhang He , Vibhav Vineet

🏛 Institutions: Oxford , Microsoft , Georgia Tech
📅 Date: January 28, 2026
📑 Publisher: arXiv
💻 Env: Desktop
🔑 Keywords: benchmark long-horizon tasks repetitive workflows condensed demonstrations OS-Marathon

TLDR

OS-Marathon benchmarks computer-use agents on 242 long-horizon repetitive desktop workflows such as expense processing and grade entry. The paper also introduces a few-shot condensed-demonstration method for teaching the recurring sub-workflow logic behind these tasks.

Open paper arXiv Report issue

Related papers (24)

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

June 9, 2026 · arXiv
HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

April 10, 2026 · arXiv
Gym-Anything: Turn any Software into an Agent Environment

April 7, 2026 · arXiv
SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

May 24, 2026 · arXiv
CocoaBench: Evaluating Unified Digital Agents in the Wild

April 13, 2026 · arXiv
ClawBench: Can AI Agents Complete Everyday Online Tasks?

April 9, 2026 · arXiv
AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

March 19, 2026 · arXiv
MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

January 28, 2026 · arXiv
LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI Agent

January 26, 2026 · ICLR 2026 (Poster)
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

December 22, 2025 · arXiv
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

October 21, 2024 · EMNLP 2024 (Poster)
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

April 30, 2026 · arXiv
The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

April 12, 2026 · arXiv
GPA: Learning GUI Process Automation from Demonstrations

April 2, 2026 · arXiv
HippoCamp: Benchmarking Contextual Agents on Personal Computers

April 1, 2026 · arXiv
PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

March 9, 2026 · arXiv
OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

March 9, 2026 · arXiv
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

February 9, 2026 · arXiv
When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

February 9, 2026 · arXiv
EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents

January 25, 2026 · arXiv
MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction

January 19, 2026 · arXiv
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

December 31, 2025 · arXiv
VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

December 18, 2025 · arXiv
OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

December 18, 2025 · arXiv