GUI Agents Papers

GUI Agents PapersA curated, searchable list of research papers on GUI agents — models, frameworks, benchmarks, datasets, and more.https://osu-nlp-group.github.io/GUI-Agents-Paper-List/enA11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reductionhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2605-00551/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2605-00551/A11y-Compressor reformats linearized accessibility-tree observations into compact, structured representations for GUI agents, addressing the cost and noise of raw a11y trees. The compression cuts input tokens to 22% of the original while improving OSWorld task success rate by an average of 5.1 percentage points.Fri, 01 May 2026 00:00:00 GMTDesktopaccessibility treeobservation compressionefficiencyA11y-CompressorWindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environmentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-27776/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-27776/WindowsWorld targets the gap that existing GUI benchmarks focus on isolated single-application tasks, presenting a process-centric suite of 181 cross-application desktop tasks (avg 5.0 sub-goals across 17 applications, 78% multi-application). Evaluated computer-use agents fall below 21% success on multi-application tasks, substantially trailing single-application performance and exposing weak workflow-level coordination.Thu, 30 Apr 2026 00:00:00 GMTDesktopbenchmarkprocess-centriccross-applicationWindowsWindowsWorldTraining Computer Use Agents to Assess the Usability of Graphical User Interfaceshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-26020/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-26020/uxCUA frames automated GUI usability assessment as a learned agent task by (i) prioritizing important interaction flows, (ii) executing them with human-like interactions, and (iii) predicting a numerical usability score. The trained agent outperforms larger general-purpose LLMs at scoring usability and generating actionable critiques on both experimental and production interfaces.Tue, 28 Apr 2026 00:00:00 GMTGeneral GUImodelusabilityinteraction flow prioritizationuxCUAAutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmarkhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-24441/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-24441/AutoGUI-v2 unifies region-level semantics, element grounding, and state prediction into 2,753 tasks spanning six operating systems, addressing the bifurcation between black-box task-completion and shallow grounding benchmarks. Open-source models excel at functional grounding while commercial models do better at functionality description, but all struggle with complex interaction logic in uncommon actions.Mon, 27 Apr 2026 00:00:00 GMTGeneral GUIbenchmarkGUI functionality understandingregion semanticsstate predictionAutoGUI-v2Odysseys: Benchmarking Web Agents on Realistic Long Horizon Taskshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-24964/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-24964/Odysseys targets the saturation of short single-site web-agent benchmarks by curating 200 realistic long-horizon multi-site workflows graded with 1,225 rubric items. The benchmark exposes large gaps between frontier computer-use agents and human performance on extended cross-site reasoning and persistent task state.Mon, 27 Apr 2026 00:00:00 GMTWebbenchmarklong-horizonmulti-siterubric evaluationOdysseysVLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automationhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-21375/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-21375/VLAA-GUI tackles two recurring failure modes of autonomous GUI agents — premature task termination and unproductive action loops — with a modular framework that decides when to Stop, Recover, and Search. The system reaches 77.5% on OSWorld and 61.0% on WindowsAgentArena, top-performing on both with multiple LLM backbones.Thu, 23 Apr 2026 00:00:00 GMTDesktopframeworkearly stoppingrecoveryVLAA-GUIGUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Modelshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-14262/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-14262/GUI-Perturbed introduces a controlled perturbation framework that independently varies visual scenes and instructions to probe grounding robustness beyond static benchmarks. Models reporting >85% on standard benchmarks lose 27-56 points when spatial reasoning is required, exposing systematic brittleness rather than genuine grounding ability.Wed, 15 Apr 2026 00:00:00 GMTGeneral GUIbenchmarkGUI groundingdomain randomizationrobustnessGUI-PerturbedUI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Groundinghttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-14113/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-14113/UI-Zoomer treats the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate activates zoom-in only when needed, while an uncertainty-driven module picks per-instance crop sizes via variance decomposition. The training-free method improves GUI grounding by 4.2-13.4% on three benchmarks and is compatible with multiple model architectures.Wed, 15 Apr 2026 00:00:00 GMTGeneral GUIGUI groundingtraining-freeuncertainty quantificationadaptive zoomUI-ZoomerSee, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedbackhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-13019/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-13019/See-Point-Refine reframes GUI grounding as an iterative loop where the agent points, observes visual feedback, and refines, targeting editing-level grounding in dense coding interfaces that require sub-pixel accuracy. The multi-turn formulation improves grounding precision over single-shot baselines by recovering from initial errors using closed-loop visual evidence.Tue, 14 Apr 2026 00:00:00 GMTGeneral GUIGUI groundingmulti-turnvisual feedbackediting-level groundingSee-Point-RefineClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-11784/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-11784/ClawGUI provides an open-source full-stack GUI agent framework with three components: ClawGUI-RL (online RL training infrastructure for parallel virtual environments and real devices using GiGPO + Process Reward Model), ClawGUI-Eval (standardized evaluation across 6 benchmarks with 95.8% reproduction), and ClawGUI-Agent (multi-OS deployment via 12+ chat platforms). The trained ClawGUI-2B outperforms MAI-UI-2B by 6 points on MobileWorld.Mon, 13 Apr 2026 00:00:00 GMTMobilemodelframeworkreinforcement learningreward modelGiGPOClawGUICocoaBench: Evaluating Unified Digital Agents in the Wildhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-11201/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-11201/CocoaBench evaluates unified digital agents on long-horizon tasks requiring flexible composition of vision, search, and coding. Tasks are specified by an instruction and an automatic evaluation function, enabling reliable scalable evaluation across agent infrastructures. The best-evaluated system reaches only 45.1%, exposing gaps in reasoning, tool use, and visual grounding.Mon, 13 Apr 2026 00:00:00 GMTGeneral GUIbenchmarkunified digital agentslong-horizon tasksvisual groundingCocoaBenchCocoaAgentWebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmarkhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10988/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10988/WebForge automates browser agent benchmark construction via a four-agent Plan-Generate-Refine-Validate pipeline that produces interactive, self-contained web environments without human annotation. It releases WebForge-Bench (934 tasks across 7 domains and 3 difficulty levels) with seven-dimensional difficulty control that enables systematic capability profiling beyond aggregate scores.Mon, 13 Apr 2026 00:00:00 GMTWebbenchmarkdatasetautomated environment generationWebForgeWebForge-BenchThe Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10577/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10577/OS-BLIND benchmarks computer-use agents under unintended attack scenarios where benign instructions trigger harmful outcomes through environmental context. Most agents exceed 90% attack success rate, and even safety-aligned Claude 4.5 Sonnet reaches 73%. Existing safety defenses activate only initially and fail to re-engage during execution, especially when subtask decomposition obscures harmful intent.Sun, 12 Apr 2026 00:00:00 GMTDesktopbenchmarksafetysecurityunintended attacksOS-BLINDThe Amazing Agent Race: Strong Tool Users, Weak Navigatorshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10261/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10261/The Amazing Agent Race introduces 1,400 DAG-puzzle legs that require fork-merge tool chains over Wikipedia, distinguishing navigation from tool-use ability. The best agent reaches only 37.2%, with navigation errors dominating (27-52% of trials) while tool-use errors stay below 17%, revealing a navigation blind spot invisible to linear benchmarks.Sat, 11 Apr 2026 00:00:00 GMTWebbenchmarkDAG puzzlesnavigationtool useWikipediaAARCORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automationhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09155/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09155/CORA reformulates safety as selective action execution: a Guardian model estimates action-conditional risk, Conformal Risk Control calibrates an execute/abstain boundary under a user-specified risk budget, and a Diagnostician proposes interventions for rejected actions. A Goal-Lock mechanism resists visual injection. Phone-Harm benchmark with step-level harm labels is also released.Fri, 10 Apr 2026 00:00:00 GMTMobilesafetybenchmarkconformal risk controlPhone-HarmCORAEE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learninghttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09815/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09815/EE-MCP frames computer-use agent design as hybrid policy learning that balances GUI interaction and MCP API calls, with an automated pipeline for environment generation, trajectory collection, and gap-driven task synthesis. An experience bank of LLM-learned rules enables inference-time improvement: distillation wins on MCP-dominant tasks (+17.8pp) while the experience bank excels on GUI-intensive tasks (+10.0pp).Fri, 10 Apr 2026 00:00:00 GMTDesktopframeworkMCPself-evolvingexperience learninghybrid policyEE-MCPHealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Taskshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09937/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09937/HealthAdminBench evaluates computer-use agents on healthcare administration via 4 realistic GUI environments (EHR, two payer portals, fax) and 135 expert-defined tasks decomposed into 1,698 subtasks. The best agent (Claude Opus 4.6 CUA) reaches only 36.3% end-to-end despite 82.8% subtask success, exposing a large gap to real-world reliability.Fri, 10 Apr 2026 00:00:00 GMTDesktopbenchmarkhealthcareHealthAdminBenchEHRlong-horizon tasksAre GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injectionhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07831/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07831/This paper proposes Semantic-level UI Element Injection, a red-teaming method that overlays safety-aligned UI elements onto screenshots to misdirect GUI agents' visual grounding. Using a modular Editor-Overlapper-Victim pipeline with iterative search, optimized attacks improve attack success rate by up to 4.4x over random injection and transfer across models.Thu, 09 Apr 2026 00:00:00 GMTGeneral GUIsafetyred teamingvisual groundingUI injectionClawBench: Can AI Agents Complete Everyday Online Tasks?https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08523/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08523/ClawBench evaluates AI agents on 153 everyday online tasks across 144 live production websites spanning purchases, bookings, and job applications. A lightweight interception layer blocks final submissions for safe evaluation. The best model (Claude Sonnet 4.6) achieves only 33.3%, exposing a large gap in real-world web automation.Thu, 09 Apr 2026 00:00:00 GMTWebbenchmarkrealistic websitelong-horizon tasksClawBenchKnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluationhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08455/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08455/KnowU-Bench is an online benchmark for personalized mobile agents on Android emulation with 42 general, 86 personalized, and 64 proactive tasks. It hides user profiles from the agent and forces genuine preference inference through multi-turn dialogues. Even frontier models fall below 50% under vague instructions requiring preference inference.Thu, 09 Apr 2026 00:00:00 GMTMobilebenchmarkpersonalizationproactive agentsKnowU-BenchMolmoWeb: Open Visual Web Agent and Open Data for the Open Webhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08516/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08516/MolmoWeb is a family of fully open multimodal web agents (4B and 8B) trained on MolmoWebMix (100K+ synthetic trajectories and 30K+ human demonstrations). Operating as screenshot-only visual-language action policies without HTML or accessibility tree access, it achieves SOTA on WebVoyager, Online-Mind2Web, and DeepShop, outperforming larger closed models like GPT-4o.Thu, 09 Apr 2026 00:00:00 GMTWebmodeldatasetMolmoWebopen-sourcePreference Redirection via Attention Concentration: An Attack on Computer Use Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08005/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08005/PRAC is a novel attack on Computer Use Agents that redirects model attention toward a stealthy adversarial patch to alter internal preferences rather than directly manipulating outputs. The attack influences product selection on online shopping platforms and generalizes across fine-tuned variants of the same backbone, highlighting risks for CUAs built on open-weight models.Thu, 09 Apr 2026 00:00:00 GMTDesktopsecuritysafetyattackadversarial patchattention manipulationPRACSame Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systemshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07929/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07929/This paper presents a trace-level evaluation framework comparing human and GUI-agent behavior across task outcome, query formulation, and navigation in a production audio-streaming search application. With 39 participants and a state-of-the-art GUI agent on 10 multi-hop search tasks, the agent matches task success but follows search-centric, low-branching strategies versus humans' content-centric exploration.Thu, 09 Apr 2026 00:00:00 GMTWebevaluationtrace-level analysisuser simulationsearch systemsbehavioral alignmentAndroid Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actionshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07277/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07277/Android Coach shifts online RL training from Single State Single Action to Single State Multiple Actions by learning a critic that estimates action values and integrating a process reward model with group-wise advantage estimation. It improves UI-TARS-1.5-7B by 7.5% on AndroidLab and 8.3% on AndroidWorld with 1.4x higher training efficiency than PPO and GRPO.Wed, 08 Apr 2026 00:00:00 GMTMobilereinforcement learningreward modeltraining efficiencyprocess reward modelAndroid CoachGameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07429/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07429/GameWorld is a standardized browser-based benchmark for multimodal game agents with 34 games and 170 tasks. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space.Wed, 08 Apr 2026 00:00:00 GMTWebbenchmarkgame agentcomputer-use controlsemantic controlGameWorldWhat's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoninghttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06995/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06995/UILoop treats GUI reasoning as a cyclic Screen-UI elements-Action process, enabling MLLMs to explicitly learn the localization, semantic functions, and usage of key UI elements. It introduces UI Comprehension-Bench (26K samples) and achieves state-of-the-art GUI reasoning performance with improved interpretability.Wed, 08 Apr 2026 00:00:00 GMTGeneral GUIGUI groundingbenchmarkUILoopUI comprehensionDon't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correctionhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-05477/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-05477/VeriGUI treats action-effect verification as a first-class RL objective to handle non-deterministic GUI environments with network delays, rendering lags, and system failures. A Thinking-Verification-Action-Expectation framework identifies failures; two-phase training with Robust SFT and GRPO using asymmetric verification rewards reduces failure loops. A new Robustness Benchmark built on AndroidControl evaluates failure recognition and correction.Tue, 07 Apr 2026 00:00:00 GMTMobilebenchmarkreinforcement learningGRPOaction verificationself-correctionVeriGUIGym-Anything: Turn any Software into an Agent Environmenthttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06126/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06126/Gym-Anything converts any software into an interactive computer-use environment via multi-agent setup and audit. It produces CUA-World with 10K+ long-horizon tasks spanning medical science, astronomy, and enterprise systems, plus CUA-World-Long with tasks requiring 500+ steps, far exceeding existing benchmarks.Tue, 07 Apr 2026 00:00:00 GMTDesktopbenchmarkdatasetGym-AnythingCUA-Worldlong-horizon tasksMAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06134/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06134/MAESTRO extends GUI agents from task execution to decision support by maintaining a shared preference memory. It provides Preference-Grounded GUI Adaptation (augment, sort, filter, highlight) and Preference-Guided Workflow Navigation that detects preference conflicts and proposes backtracking.Tue, 07 Apr 2026 00:00:00 GMTGeneral GUIpersonalizationMAESTROpreferenceGUI adaptationWebSP-Eval: Evaluating Web Agents on Website Security and Privacy Taskshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06367/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06367/WebSP-Eval is the first framework evaluating web agents on user-facing website security and privacy tasks such as cookie preferences, privacy settings, and session revocation. Across 200 task instances on 28 websites, agents fail more than 45% on tasks with stateful UI elements like toggles and checkboxes.Tue, 07 Apr 2026 00:00:00 GMTWebbenchmarksecurityprivacyWebSP-EvalGUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosishttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-04399/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-04399/GUIDE decomposes GUI agent trajectory evaluation into three sequential stages — trajectory segmentation, subtask diagnosis, and structured error analysis — mirroring the compositional structure of GUI tasks. Evaluated on 932 industrial e-commerce trajectories, AGENTREWARDBENCH, and AndroidBench, it improves accuracy by up to 5.35 points over baselines while producing diagnostic insights.Mon, 06 Apr 2026 00:00:00 GMTGeneral GUIbenchmarkevaluationtrajectory diagnosishierarchical diagnosiserror analysisGUIDEIntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-05157/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-05157/IntentScore is a plan-aware reward model for computer-use agents trained from 398K offline GUI interaction steps across three OSes, using contrastive alignment and margin ranking objectives. It achieves 97.5% pairwise discrimination and, when used as a re-ranker for Agent S3 on OSWorld, improves task success rate by 6.9 points.Mon, 06 Apr 2026 00:00:00 GMTDesktopreward modelmodelplan-aware rewardcontrastive alignmentmargin rankingOSWorldThe Art of Building Verifiers for Computer Use Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06240/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06240/Presents lessons from building Universal Verifier for web agent trajectories, based on four principles: meaningful rubrics, separated process/outcome rewards, controllable vs. uncontrollable failure distinction, and divide-and-conquer context management. Reduces false positive rates to near zero compared to WebVoyager (45%+) and WebJudge (22%+).Sun, 05 Apr 2026 00:00:00 GMTWebbenchmarkreward modelverificationCUAVerifierBenchThe Tool Illusion: Rethinking Tool Use in Web Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-03465/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-03465/An extensive controlled study across diverse tool sources, backbone models, tool-use frameworks, and evaluation benchmarks to determine whether tools provide consistent gains for web agents. Findings revise some prior conclusions and complement others with broader evidence.Fri, 03 Apr 2026 00:00:00 GMTWebempirical studytool useWebArenaGPA: Learning GUI Process Automation from Demonstrationshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-01676/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-01676/GPA is a vision-based GUI process automation system that enables fast and stable process replay from a single demonstration. Using Sequential Monte Carlo-based localization and readiness calibration, it achieves higher success rates with 10x faster execution than Gemini 3 Pro on long-horizon GUI tasks, running entirely locally without cloud LLMs.Thu, 02 Apr 2026 00:00:00 GMTDesktoptraining-freeprocess automationlong-horizon tasksGPArobotic process automationHippoCamp: Benchmarking Contextual Agents on Personal Computershttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-01221/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-01221/HippoCamp evaluates contextual agents on personal computers by modeling individual user profiles and requiring search over massive personal file collections to support context-aware reasoning, going beyond app-isolated benchmarks. The benchmark emphasizes multimodal file management as a first-class capability for personal-computing assistants.Wed, 01 Apr 2026 00:00:00 GMTDesktopbenchmarkcontextual agentpersonal filesmultimodal file managementHippoCampProactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistantshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00842/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00842/Pare models digital apps as finite state machines with stateful navigation to enable realistic active user simulation for proactive agents. Pare-Bench provides 143 diverse tasks spanning communication, productivity, scheduling, and lifestyle apps to test context observation, goal inference, and intervention timing.Wed, 01 Apr 2026 00:00:00 GMTMobilebenchmarkproactive agentsPareWhen Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigationhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00892/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00892/The first systematic study of interruptible agents in long-horizon web navigation. It formalizes three interruption types (addition, revision, retraction) and introduces InterruptBench derived from WebArena-Lite, showing that handling mid-task user interruptions remains challenging for current LLMs.Wed, 01 Apr 2026 00:00:00 GMTWebbenchmarkinterruptibilityInterruptBenchWebArenaPSPA-Bench: A Personalized Benchmark for Smartphone GUI Agenthttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-29318/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-29318/PSPA-Bench evaluates personalization in smartphone GUI agents with 12,855+ personalized instructions across 10 daily-use scenarios and 22 mobile apps. Even the strongest of 11 benchmarked agents performs poorly under personalized settings, highlighting gaps in reasoning, perception, and long-term memory.Tue, 31 Mar 2026 00:00:00 GMTMobilebenchmarkdatasetpersonalizationPSPA-BenchTerminal Agents Suffice for Enterprise Automationhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00073/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00073/This paper shows that a coding agent equipped only with a terminal and filesystem can match or outperform GUI-driven and MCP tool-augmented agents for enterprise automation tasks across ServiceNow, GitLab, and ERPNext, arguing that simple programmatic API interfaces combined with strong foundation models suffice.Tue, 31 Mar 2026 00:00:00 GMTGeneral GUIenterprise automationterminal agentsempirical studyWebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scalehttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/webarena-infinity-generating-browser-environments-with-verifiable-tasks-at-scale/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/webarena-infinity-generating-browser-environments-with-verifiable-tasks-at-scale/WebArena-Infinity automates the generation of high-authenticity web environments with verifiable tasks from static artifacts like user manuals, using a multi-agent pipeline of coding and browser-use agents. It produces 10 environments with 1,260 tasks and 2,070 trajectories. Agents achieve notably lower success rates than on manually built benchmarks, suggesting the generated tasks capture meaningful complexity.Tue, 31 Mar 2026 00:00:00 GMTWebbenchmarkdatasetenvironment synthesisverifiable rewardsreinforcement learningWebArenaGUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotationhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26266/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26266/GUIDE is a training-free add-on for desktop GUI agents that retrieves relevant tutorial videos, turns them into planning and grounding annotations, and injects that expertise into existing agents without changing model parameters. On OSWorld, it improves multiple agent families while also reducing execution steps.Fri, 27 Mar 2026 00:00:00 GMTDesktopvideo retrievalautomatic annotationdomain biastraining-freeOSWorldGUIDERethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectiveshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26041/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26041/This paper studies token pruning for historical GUI screenshots and finds that background regions still carry useful state-transition cues, random pruning preserves spatial structure surprisingly well, and allocating larger token budgets to recent screenshots keeps performance nearly unchanged while reducing cost.Fri, 27 Mar 2026 00:00:00 GMTGeneral GUItoken pruninghistorical screenshotsrandom pruningrecency effectempirical studyTowards GUI Agents: Vision-Language Diffusion Models for GUI Groundinghttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26211/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26211/This paper adapts the discrete diffusion model LLaDA-V to GUI grounding and proposes a hybrid masking schedule for bounding-box prediction. Across web, desktop, and mobile benchmarks, the diffusion model outperforms its linear-masked variant and remains competitive with autoregressive VLMs.Fri, 27 Mar 2026 00:00:00 GMTGeneral GUIGUI groundingdiffusion modelshybrid maskingbounding-box predictionLLaDA-Vcross-platformVision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verificationhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26648/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26648/Vision2Web is a hierarchical benchmark for visual website development that spans static UI-to-code, interactive frontend reproduction, and full-stack website construction. It evaluates coding agents with workflow-based verification using a GUI agent verifier and a VLM judge, and shows that current models still struggle badly on full-stack tasks.Fri, 27 Mar 2026 00:00:00 GMTWebbenchmarkwebsite developmentagent verificationUI-to-codefull-stack developmentVision2WebGUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Taskshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-25864/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-25864/GUIDE studies collaborative GUI assistance rather than pure task automation, using 67.5 hours of think-aloud recordings from 120 novice users across 10 software applications. It benchmarks behavior-state detection, intent prediction, and help prediction, and shows that current multimodal models still struggle to infer what users are doing and when intervention would be useful.Thu, 26 Mar 2026 00:00:00 GMTGeneral GUIbenchmarkcollaborative assistancebehavior state detectionintent predictionhelp predictionthink-aloud dataWebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testinghttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-25226/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-25226/WebTestBench studies end-to-end automated web testing rather than ordinary task completion, decomposing the problem into checklist generation and defect detection across diverse web applications. Its WebTester baseline shows that current systems still struggle with test completeness, latent logical defects, and long-horizon reliability.Thu, 26 Mar 2026 00:00:00 GMTWebbenchmarkautomated testingdefect detectionchecklist generationlatent logical defectsWebTestBenchCUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24440/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24440/CUA-Suite is a large-scale desktop-agent data ecosystem centered on continuous expert video rather than sparse screenshots. It combines VideoCUA, UI-Vision, and GroundCUA to provide 55 hours of demonstrations, dense grounding annotations, and evaluation data across 87 professional desktop applications where current foundation action models still fail frequently.Wed, 25 Mar 2026 00:00:00 GMTDesktopdatasetvideo demonstrationsdesktop workflowsgrounding datasetVideoCUAGroundCUATowards Automated Crowdsourced Testing via Personified-LLMhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24160/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24160/PersonaTester automates crowdsourced GUI testing by injecting empirically derived tester personas into LLM agents. On 15 mobile apps, it reproduces more diverse testing behaviors than non-persona baselines and triggers more crashes and functional bugs.Wed, 25 Mar 2026 00:00:00 GMTMobileGUI testingcrowdsourced testingpersona-guided testingbug findingPersonaTesterUI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experiencehttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24533/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24533/UI-Voyager is a self-evolving mobile GUI agent that learns from failed trajectories instead of manual annotations. Its two-stage training combines rejection fine-tuning with group-relative self-distillation to turn successful rollouts into dense corrective supervision, yielding 81.0% Pass@1 on AndroidWorld with a 4B model.Wed, 25 Mar 2026 00:00:00 GMTMobilereinforcement learningself-evolvingfailed trajectory learningRFTGRSDAndroidWorldAgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-23007/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-23007/AgentRAE is a backdoor attack against screenshot-based mobile GUI agents that uses benign-looking notification icons as triggers for remote action execution. Its contrastive-pretraining plus poisoning pipeline preserves clean performance, exceeds 90% attack success over ten mobile operations, and evades eight representative defenses.Tue, 24 Mar 2026 00:00:00 GMTMobilebackdoor attacknotification triggersecuritycontrastive learningRemote Action ExecutionAgentRAECAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Traininghttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-23559/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-23559/ReCAP is a native GUI agent specialized for interactive CAPTCHA solving. It builds a seven-type CAPTCHA environment, generates large-scale reasoning-action trajectories plus self-correction data from failed attempts, and improves success from about 30% to 80% without sacrificing general GUI-agent performance.Mon, 23 Mar 2026 00:00:00 GMTGeneral GUICAPTCHAreasoning-action data generationself-correctionnative GUI agentReCAPEgo2Web: A Web Agent Benchmark Grounded in Egocentric Videoshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-22529/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-22529/Ego2Web is a benchmark that couples egocentric first-person videos with web tasks requiring real-world visual understanding before online interaction. It also introduces Ego2WebJudge, an LLM-as-a-judge evaluator with about 84% agreement with humans, and shows large headroom for current agents.Mon, 23 Mar 2026 00:00:00 GMTWebbenchmarkegocentric videoLLM-as-a-judgeweb planningEgo2WebEgo2WebJudgeContractSkill: Repairable Contract-Based Skills for Multimodal Web Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-20340/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-20340/ContractSkill turns draft web-agent skills into executable artifacts with explicit contracts, enabling deterministic verification, local fault localization, and patch-based repair instead of full skill rewrites, improving skill reliability on VisualWebArena and MiniWoB while preserving transfer across models.Fri, 20 Mar 2026 00:00:00 GMTWebagent skillcontractprogram repairverificationcross-model transferAndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-18429/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-18429/AndroTMem studies interaction memory in long-horizon Android GUI agents through a 1,069-task benchmark designed to require carrying forward critical intermediate state. It introduces Anchored State Memory, which stores causally linked state anchors and improves completion rates by 5%-30.16% over replay and summary baselines across 12 agents.Thu, 19 Mar 2026 00:00:00 GMTMobilebenchmarkagent memorylong-horizon tasksAnchored State MemoryAndroTMem-BenchAndroTMemOS-Themis: A Scalable Critic Framework for Generalist GUI Rewardshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-19191/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-19191/OS-Themis is a scalable critic framework for GUI reward modeling that breaks trajectories into verifiable milestones and audits the evidence chain before issuing a verdict. It improves AndroidWorld training and filtering loops and introduces OmniGUIRewardBench as a cross-platform benchmark for GUI outcome rewards.Thu, 19 Mar 2026 00:00:00 GMTGeneral GUIreward modelreinforcement learningcritic frameworkOmniGUIRewardBenchmilestone decompositionOS-ThemisAdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinementhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17441/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17441/AdaZoom-GUI targets two concrete GUI-grounding bottlenecks: ambiguous natural-language instructions and tiny UI elements in high-resolution screenshots. It combines instruction rewriting with a conditional second-stage zoom-in pass and reports state-of-the-art grounding performance among comparable model sizes.Wed, 18 Mar 2026 00:00:00 GMTGeneral GUIGUI groundinginstruction refinementadaptive zoomGRPObounding-box predictionAdaZoom-GUIFailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repairhttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17826/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17826/FailureMem is a multimodal automated program repair framework for settings where repair requires joint reasoning over code, issue text, and GUI screenshots. It combines hybrid workflow-agent control, region-level visual grounding, and a Failure Memory Bank that converts failed repair attempts into reusable guidance, improving resolved rate over GUIRepair on SWE-bench Multimodal.Wed, 18 Mar 2026 00:00:00 GMTGeneral GUImultimodal program repairsoftware engineeringprogram repairfailure memorySWE-bench MultimodalFailureMemWebPII: Benchmarking Visual PII Detection for Computer-Use Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17357/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17357/WebPII is a benchmark for detecting personally identifiable information in web screenshots, centered on e-commerce interfaces where agents may expose sensitive content during browsing and form completion. It extends the PII taxonomy to transaction-level identifiers and partially filled forms, and pairs the benchmark with WebRedact to show practical privacy-preserving deployment.Wed, 18 Mar 2026 00:00:00 GMTWebbenchmarkprivacyPII detectionvisual redactione-commerce UIWebPIIGUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agentshttps://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-15039/https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-15039/GUI-CEval is the first comprehensive Chinese benchmark for mobile GUI agents, spanning 201 apps across four device types with a hierarchical two-level evaluation structure (atomic abilities and application-level tasks) along five dimensions (perception, planning, reflection, execution, evaluation), revealing that most MLLMs still struggle with reflective decision-making and post-action self-evaluation.Mon, 16 Mar 2026 00:00:00 GMTMobilebenchmarkchinesehierarchical evaluationphysical-device evaluationGUI-CEval