You Only Look at Screens: Multimodal Chain-of-Action Agents
Zhuosheng Zhang, Aston Zhang
- 🏛 Institutions
- SJTU, Meta
- 📅 Date
- September 20, 2023
- 📑 Publisher
- Findings of ACL 2024
- 💻 Env
- Mobile
- 🔑 Keywords
TLDR
Auto-GUI is a screenshot-only mobile GUI agent that avoids environment parsing and application-specific APIs. The paper introduces a chain-of-action prompting technique and evaluates the method on AITW, a device-control benchmark with 30K unique instructions.
Related papers
- GUITester: Enabling GUI Agents for Exploratory Defect DiscoveryJanuary 8, 2026 · arXiv
- GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI AgentMay 22, 2025 · ACL 2025
- LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration BenchmarkApril 18, 2025 · arXiv
- ClickAgent: Enhancing UI Location Capabilities of Autonomous AgentsOctober 9, 2024 · SIGDIAL 2025
- AutoDroid: LLM-powered Task Automation in AndroidAugust 29, 2023 · MobiCom 2024
- Android in the Wild: A Large-Scale Dataset for Android Device ControlJuly 19, 2023 · NeurIPS 2023 Datasets and Benchmarks Track