DPO Learning with LLMs-Judge Signal for Computer Use Agents

Man Luo , David Cobbley , Xin Su , Shachar Rosenman , Vasudev Lal , Shao-Yen Tseng , Phillip Howard

🏛 Institutions: Intel , Thoughtworks
📅 Date: June 3, 2025
📑 Publisher: arXiv
💻 Env: Desktop
🔑 Keywords: model reinforcement learning DPO LLM-as-Judge local inference synthetic trajectories

TLDR

This paper targets privacy and compute constraints in computer-use agents by training a lightweight VLM that runs entirely on local machines. It uses an LLM-as-Judge pipeline to score synthetic GUI trajectories and construct DPO preference pairs, then shows that the resulting local agent outperforms baselines on OSWorld.

Open paper arXiv Report issue