Evolving in Tasks: Empowering the Multi-modality Large Language Model as the Computer Use Agent
Yuhao Cheng, Liang Tang, Shuxian Li, Yukang Huo, Tiaonan Duan, Kaer Huang, Yanzhe Jing, Yiqiang Yan
- 🏛 Institutions
- Lenovo, China Agricultural University
- 📅 Date
- August 6, 2025
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop Web
- 🔑 Keywords
TLDR
This paper proposes the Self-Evolution Agent (SEA) for computer use, combining automatic verifiable trajectory generation, efficient step-wise reinforcement learning, and a model-enhancement path that merges grounding and planning ability. It evaluates the resulting agent on grounding benchmarks and OSWorld and frames the method as a way to improve computer-use performance without relying purely on manually curated data.
Related papers
- Attacking Vision-Language Computer Agents via Pop-upsNovember 4, 2024 · ACL 2025
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsApril 11, 2024 · NeurIPS 2024 Datasets and Benchmarks Track
- IntentScore: Intent-Conditioned Action Evaluation for Computer-Use AgentsApril 6, 2026 · arXiv
- GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play AnnotationMarch 27, 2026 · arXiv
- EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic ExperienceJanuary 22, 2026 · arXiv
- CaMeLs Can Use Computers Too: System-level Security for Computer Use AgentsJanuary 14, 2026 · arXiv