ScreenAgent: A Vision Language Model-driven Computer Control Agent

Runliang Niu , Jindong Li , Shiqi Wang , Yali Fu , Xiyu Hu , Xueyuan Leng , He Kong , Yi Chang , Qi Wang

🏛 Institutions: Jilin University
📅 Date: February 13, 2024
📑 Publisher: IJCAI 2024
💻 Env: Desktop
🔑 Keywords: dataset planning-acting-reflecting computer control UI positioning ScreenAgent

TLDR

ScreenAgent builds a real computer-control environment where a vision-language agent interacts with screenshots through mouse and keyboard actions, and pairs it with a planning-acting-reflecting control pipeline. The paper also releases the ScreenAgent Dataset and reports computer-control performance comparable to GPT-4V with more precise UI positioning.

Open paper Report issue