Scaling Agents for Computer Use

Gonzalo Gonzalez-Pumariega , Vincent Tu , Chih-Lun Lee , Jiachen Yang , Ang Li , Xin Eric Wang

🏛 Institutions: Simular Research
📅 Date: October 2, 2025
📑 Publisher: arXiv
💻 Env: Desktop
🔑 Keywords: test-time scaling multiple rollouts behavior narratives behavior judge OSWorld BJudge

TLDR

This paper argues that computer-use agents scale more effectively across multiple rollouts than within a single rollout, and introduces Behavior Judge (BJudge) to compare candidate trajectories via compact behavior narratives. BJudge reaches 72.6% on OSWorld, slightly surpassing reported human performance, and also generalizes to WindowsAgentArena and AndroidWorld.

Open paper arXiv Report issue