From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

Zezhou Wang , Ziyun Zhang , Xiaoyi Zhang , Zhuzhong Qian , Yan Lu

🏛 Institutions: NJU , PKU , MSR Asia
📅 Date: January 9, 2026
📑 Publisher: arXiv
💻 Env: General GUI
🔑 Keywords: reinforcement learning RLVR off-policy assimilation BEPA OSWorld

TLDR

BEPA improves end-to-end GUI-agent training with verifiable rewards by turning scarce off-policy expert traces into policy-aligned guidance through self-rolled reachable trajectories and a dynamically updated per-task cache. On OSWorld-Verified it raises UI-TARS-1.5-7B from 22.87% to 32.13%, with additional gains on MMBench-GUI and Online-Mind2Web.

Open paper arXiv Report issue