<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>GUI Agents Papers</title><description>A curated, searchable list of research papers on GUI agents — models, frameworks, benchmarks, datasets, and more.</description><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/</link><language>en</language><item><title>A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reduction</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2605-00551/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2605-00551/</guid><description>A11y-Compressor reformats linearized accessibility-tree observations into compact, structured representations for GUI agents, addressing the cost and noise of raw a11y trees. The compression cuts input tokens to 22% of the original while improving OSWorld task success rate by an average of 5.1 percentage points.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>accessibility tree</category><category>observation compression</category><category>efficiency</category><category>A11y-Compressor</category></item><item><title>WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-27776/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-27776/</guid><description>WindowsWorld targets the gap that existing GUI benchmarks focus on isolated single-application tasks, presenting a process-centric suite of 181 cross-application desktop tasks (avg 5.0 sub-goals across 17 applications, 78% multi-application). Evaluated computer-use agents fall below 21% success on multi-application tasks, substantially trailing single-application performance and exposing weak workflow-level coordination.</description><pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>benchmark</category><category>process-centric</category><category>cross-application</category><category>Windows</category><category>WindowsWorld</category></item><item><title>Training Computer Use Agents to Assess the Usability of Graphical User Interfaces</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-26020/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-26020/</guid><description>uxCUA frames automated GUI usability assessment as a learned agent task by (i) prioritizing important interaction flows, (ii) executing them with human-like interactions, and (iii) predicting a numerical usability score. The trained agent outperforms larger general-purpose LLMs at scoring usability and generating actionable critiques on both experimental and production interfaces.</description><pubDate>Tue, 28 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>model</category><category>usability</category><category>interaction flow prioritization</category><category>uxCUA</category></item><item><title>AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-24441/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-24441/</guid><description>AutoGUI-v2 unifies region-level semantics, element grounding, and state prediction into 2,753 tasks spanning six operating systems, addressing the bifurcation between black-box task-completion and shallow grounding benchmarks. Open-source models excel at functional grounding while commercial models do better at functionality description, but all struggle with complex interaction logic in uncommon actions.</description><pubDate>Mon, 27 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>benchmark</category><category>GUI functionality understanding</category><category>region semantics</category><category>state prediction</category><category>AutoGUI-v2</category></item><item><title>Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-24964/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-24964/</guid><description>Odysseys targets the saturation of short single-site web-agent benchmarks by curating 200 realistic long-horizon multi-site workflows graded with 1,225 rubric items. The benchmark exposes large gaps between frontier computer-use agents and human performance on extended cross-site reasoning and persistent task state.</description><pubDate>Mon, 27 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>long-horizon</category><category>multi-site</category><category>rubric evaluation</category><category>Odysseys</category></item><item><title>VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-21375/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-21375/</guid><description>VLAA-GUI tackles two recurring failure modes of autonomous GUI agents — premature task termination and unproductive action loops — with a modular framework that decides when to Stop, Recover, and Search. The system reaches 77.5% on OSWorld and 61.0% on WindowsAgentArena, top-performing on both with multiple LLM backbones.</description><pubDate>Thu, 23 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>framework</category><category>early stopping</category><category>recovery</category><category>VLAA-GUI</category></item><item><title>GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-14262/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-14262/</guid><description>GUI-Perturbed introduces a controlled perturbation framework that independently varies visual scenes and instructions to probe grounding robustness beyond static benchmarks. Models reporting &gt;85% on standard benchmarks lose 27-56 points when spatial reasoning is required, exposing systematic brittleness rather than genuine grounding ability.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>benchmark</category><category>GUI grounding</category><category>domain randomization</category><category>robustness</category><category>GUI-Perturbed</category></item><item><title>UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-14113/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-14113/</guid><description>UI-Zoomer treats the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate activates zoom-in only when needed, while an uncertainty-driven module picks per-instance crop sizes via variance decomposition. The training-free method improves GUI grounding by 4.2-13.4% on three benchmarks and is compatible with multiple model architectures.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>GUI grounding</category><category>training-free</category><category>uncertainty quantification</category><category>adaptive zoom</category><category>UI-Zoomer</category></item><item><title>See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-13019/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-13019/</guid><description>See-Point-Refine reframes GUI grounding as an iterative loop where the agent points, observes visual feedback, and refines, targeting editing-level grounding in dense coding interfaces that require sub-pixel accuracy. The multi-turn formulation improves grounding precision over single-shot baselines by recovering from initial errors using closed-loop visual evidence.</description><pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>GUI grounding</category><category>multi-turn</category><category>visual feedback</category><category>editing-level grounding</category><category>See-Point-Refine</category></item><item><title>ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-11784/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-11784/</guid><description>ClawGUI provides an open-source full-stack GUI agent framework with three components: ClawGUI-RL (online RL training infrastructure for parallel virtual environments and real devices using GiGPO + Process Reward Model), ClawGUI-Eval (standardized evaluation across 6 benchmarks with 95.8% reproduction), and ClawGUI-Agent (multi-OS deployment via 12+ chat platforms). The trained ClawGUI-2B outperforms MAI-UI-2B by 6 points on MobileWorld.</description><pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>model</category><category>framework</category><category>reinforcement learning</category><category>reward model</category><category>GiGPO</category><category>ClawGUI</category></item><item><title>CocoaBench: Evaluating Unified Digital Agents in the Wild</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-11201/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-11201/</guid><description>CocoaBench evaluates unified digital agents on long-horizon tasks requiring flexible composition of vision, search, and coding. Tasks are specified by an instruction and an automatic evaluation function, enabling reliable scalable evaluation across agent infrastructures. The best-evaluated system reaches only 45.1%, exposing gaps in reasoning, tool use, and visual grounding.</description><pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>benchmark</category><category>unified digital agents</category><category>long-horizon tasks</category><category>visual grounding</category><category>CocoaBench</category><category>CocoaAgent</category></item><item><title>WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10988/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10988/</guid><description>WebForge automates browser agent benchmark construction via a four-agent Plan-Generate-Refine-Validate pipeline that produces interactive, self-contained web environments without human annotation. It releases WebForge-Bench (934 tasks across 7 domains and 3 difficulty levels) with seven-dimensional difficulty control that enables systematic capability profiling beyond aggregate scores.</description><pubDate>Mon, 13 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>dataset</category><category>automated environment generation</category><category>WebForge</category><category>WebForge-Bench</category></item><item><title>The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10577/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10577/</guid><description>OS-BLIND benchmarks computer-use agents under unintended attack scenarios where benign instructions trigger harmful outcomes through environmental context. Most agents exceed 90% attack success rate, and even safety-aligned Claude 4.5 Sonnet reaches 73%. Existing safety defenses activate only initially and fail to re-engage during execution, especially when subtask decomposition obscures harmful intent.</description><pubDate>Sun, 12 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>benchmark</category><category>safety</category><category>security</category><category>unintended attacks</category><category>OS-BLIND</category></item><item><title>The Amazing Agent Race: Strong Tool Users, Weak Navigators</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10261/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-10261/</guid><description>The Amazing Agent Race introduces 1,400 DAG-puzzle legs that require fork-merge tool chains over Wikipedia, distinguishing navigation from tool-use ability. The best agent reaches only 37.2%, with navigation errors dominating (27-52% of trials) while tool-use errors stay below 17%, revealing a navigation blind spot invisible to linear benchmarks.</description><pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>DAG puzzles</category><category>navigation</category><category>tool use</category><category>Wikipedia</category><category>AAR</category></item><item><title>CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09155/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09155/</guid><description>CORA reformulates safety as selective action execution: a Guardian model estimates action-conditional risk, Conformal Risk Control calibrates an execute/abstain boundary under a user-specified risk budget, and a Diagnostician proposes interventions for rejected actions. A Goal-Lock mechanism resists visual injection. Phone-Harm benchmark with step-level harm labels is also released.</description><pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>safety</category><category>benchmark</category><category>conformal risk control</category><category>Phone-Harm</category><category>CORA</category></item><item><title>EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09815/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09815/</guid><description>EE-MCP frames computer-use agent design as hybrid policy learning that balances GUI interaction and MCP API calls, with an automated pipeline for environment generation, trajectory collection, and gap-driven task synthesis. An experience bank of LLM-learned rules enables inference-time improvement: distillation wins on MCP-dominant tasks (+17.8pp) while the experience bank excels on GUI-intensive tasks (+10.0pp).</description><pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>framework</category><category>MCP</category><category>self-evolving</category><category>experience learning</category><category>hybrid policy</category><category>EE-MCP</category></item><item><title>HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09937/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-09937/</guid><description>HealthAdminBench evaluates computer-use agents on healthcare administration via 4 realistic GUI environments (EHR, two payer portals, fax) and 135 expert-defined tasks decomposed into 1,698 subtasks. The best agent (Claude Opus 4.6 CUA) reaches only 36.3% end-to-end despite 82.8% subtask success, exposing a large gap to real-world reliability.</description><pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>benchmark</category><category>healthcare</category><category>HealthAdminBench</category><category>EHR</category><category>long-horizon tasks</category></item><item><title>Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07831/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07831/</guid><description>This paper proposes Semantic-level UI Element Injection, a red-teaming method that overlays safety-aligned UI elements onto screenshots to misdirect GUI agents&apos; visual grounding. Using a modular Editor-Overlapper-Victim pipeline with iterative search, optimized attacks improve attack success rate by up to 4.4x over random injection and transfer across models.</description><pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>safety</category><category>red teaming</category><category>visual grounding</category><category>UI injection</category></item><item><title>ClawBench: Can AI Agents Complete Everyday Online Tasks?</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08523/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08523/</guid><description>ClawBench evaluates AI agents on 153 everyday online tasks across 144 live production websites spanning purchases, bookings, and job applications. A lightweight interception layer blocks final submissions for safe evaluation. The best model (Claude Sonnet 4.6) achieves only 33.3%, exposing a large gap in real-world web automation.</description><pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>realistic website</category><category>long-horizon tasks</category><category>ClawBench</category></item><item><title>KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08455/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08455/</guid><description>KnowU-Bench is an online benchmark for personalized mobile agents on Android emulation with 42 general, 86 personalized, and 64 proactive tasks. It hides user profiles from the agent and forces genuine preference inference through multi-turn dialogues. Even frontier models fall below 50% under vague instructions requiring preference inference.</description><pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>benchmark</category><category>personalization</category><category>proactive agents</category><category>KnowU-Bench</category></item><item><title>MolmoWeb: Open Visual Web Agent and Open Data for the Open Web</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08516/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08516/</guid><description>MolmoWeb is a family of fully open multimodal web agents (4B and 8B) trained on MolmoWebMix (100K+ synthetic trajectories and 30K+ human demonstrations). Operating as screenshot-only visual-language action policies without HTML or accessibility tree access, it achieves SOTA on WebVoyager, Online-Mind2Web, and DeepShop, outperforming larger closed models like GPT-4o.</description><pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>model</category><category>dataset</category><category>MolmoWeb</category><category>open-source</category></item><item><title>Preference Redirection via Attention Concentration: An Attack on Computer Use Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08005/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-08005/</guid><description>PRAC is a novel attack on Computer Use Agents that redirects model attention toward a stealthy adversarial patch to alter internal preferences rather than directly manipulating outputs. The attack influences product selection on online shopping platforms and generalizes across fine-tuned variants of the same backbone, highlighting risks for CUAs built on open-weight models.</description><pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>security</category><category>safety</category><category>attack</category><category>adversarial patch</category><category>attention manipulation</category><category>PRAC</category></item><item><title>Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07929/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07929/</guid><description>This paper presents a trace-level evaluation framework comparing human and GUI-agent behavior across task outcome, query formulation, and navigation in a production audio-streaming search application. With 39 participants and a state-of-the-art GUI agent on 10 multi-hop search tasks, the agent matches task success but follows search-centric, low-branching strategies versus humans&apos; content-centric exploration.</description><pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>evaluation</category><category>trace-level analysis</category><category>user simulation</category><category>search systems</category><category>behavioral alignment</category></item><item><title>Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07277/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07277/</guid><description>Android Coach shifts online RL training from Single State Single Action to Single State Multiple Actions by learning a critic that estimates action values and integrating a process reward model with group-wise advantage estimation. It improves UI-TARS-1.5-7B by 7.5% on AndroidLab and 8.3% on AndroidWorld with 1.4x higher training efficiency than PPO and GRPO.</description><pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>reinforcement learning</category><category>reward model</category><category>training efficiency</category><category>process reward model</category><category>Android Coach</category></item><item><title>GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07429/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-07429/</guid><description>GameWorld is a standardized browser-based benchmark for multimodal game agents with 34 games and 170 tasks. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space.</description><pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>game agent</category><category>computer-use control</category><category>semantic control</category><category>GameWorld</category></item><item><title>What&apos;s Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06995/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06995/</guid><description>UILoop treats GUI reasoning as a cyclic Screen-UI elements-Action process, enabling MLLMs to explicitly learn the localization, semantic functions, and usage of key UI elements. It introduces UI Comprehension-Bench (26K samples) and achieves state-of-the-art GUI reasoning performance with improved interpretability.</description><pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>GUI grounding</category><category>benchmark</category><category>UILoop</category><category>UI comprehension</category></item><item><title>Don&apos;t Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-05477/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-05477/</guid><description>VeriGUI treats action-effect verification as a first-class RL objective to handle non-deterministic GUI environments with network delays, rendering lags, and system failures. A Thinking-Verification-Action-Expectation framework identifies failures; two-phase training with Robust SFT and GRPO using asymmetric verification rewards reduces failure loops. A new Robustness Benchmark built on AndroidControl evaluates failure recognition and correction.</description><pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>benchmark</category><category>reinforcement learning</category><category>GRPO</category><category>action verification</category><category>self-correction</category><category>VeriGUI</category></item><item><title>Gym-Anything: Turn any Software into an Agent Environment</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06126/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06126/</guid><description>Gym-Anything converts any software into an interactive computer-use environment via multi-agent setup and audit. It produces CUA-World with 10K+ long-horizon tasks spanning medical science, astronomy, and enterprise systems, plus CUA-World-Long with tasks requiring 500+ steps, far exceeding existing benchmarks.</description><pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>benchmark</category><category>dataset</category><category>Gym-Anything</category><category>CUA-World</category><category>long-horizon tasks</category></item><item><title>MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06134/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06134/</guid><description>MAESTRO extends GUI agents from task execution to decision support by maintaining a shared preference memory. It provides Preference-Grounded GUI Adaptation (augment, sort, filter, highlight) and Preference-Guided Workflow Navigation that detects preference conflicts and proposes backtracking.</description><pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>personalization</category><category>MAESTRO</category><category>preference</category><category>GUI adaptation</category></item><item><title>WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06367/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06367/</guid><description>WebSP-Eval is the first framework evaluating web agents on user-facing website security and privacy tasks such as cookie preferences, privacy settings, and session revocation. Across 200 task instances on 28 websites, agents fail more than 45% on tasks with stateful UI elements like toggles and checkboxes.</description><pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>security</category><category>privacy</category><category>WebSP-Eval</category></item><item><title>GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-04399/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-04399/</guid><description>GUIDE decomposes GUI agent trajectory evaluation into three sequential stages — trajectory segmentation, subtask diagnosis, and structured error analysis — mirroring the compositional structure of GUI tasks. Evaluated on 932 industrial e-commerce trajectories, AGENTREWARDBENCH, and AndroidBench, it improves accuracy by up to 5.35 points over baselines while producing diagnostic insights.</description><pubDate>Mon, 06 Apr 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>benchmark</category><category>evaluation</category><category>trajectory diagnosis</category><category>hierarchical diagnosis</category><category>error analysis</category><category>GUIDE</category></item><item><title>IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-05157/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-05157/</guid><description>IntentScore is a plan-aware reward model for computer-use agents trained from 398K offline GUI interaction steps across three OSes, using contrastive alignment and margin ranking objectives. It achieves 97.5% pairwise discrimination and, when used as a re-ranker for Agent S3 on OSWorld, improves task success rate by 6.9 points.</description><pubDate>Mon, 06 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>reward model</category><category>model</category><category>plan-aware reward</category><category>contrastive alignment</category><category>margin ranking</category><category>OSWorld</category></item><item><title>The Art of Building Verifiers for Computer Use Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06240/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-06240/</guid><description>Presents lessons from building Universal Verifier for web agent trajectories, based on four principles: meaningful rubrics, separated process/outcome rewards, controllable vs. uncontrollable failure distinction, and divide-and-conquer context management. Reduces false positive rates to near zero compared to WebVoyager (45%+) and WebJudge (22%+).</description><pubDate>Sun, 05 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>reward model</category><category>verification</category><category>CUAVerifierBench</category></item><item><title>The Tool Illusion: Rethinking Tool Use in Web Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-03465/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-03465/</guid><description>An extensive controlled study across diverse tool sources, backbone models, tool-use frameworks, and evaluation benchmarks to determine whether tools provide consistent gains for web agents. Findings revise some prior conclusions and complement others with broader evidence.</description><pubDate>Fri, 03 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>empirical study</category><category>tool use</category><category>WebArena</category></item><item><title>GPA: Learning GUI Process Automation from Demonstrations</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-01676/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-01676/</guid><description>GPA is a vision-based GUI process automation system that enables fast and stable process replay from a single demonstration. Using Sequential Monte Carlo-based localization and readiness calibration, it achieves higher success rates with 10x faster execution than Gemini 3 Pro on long-horizon GUI tasks, running entirely locally without cloud LLMs.</description><pubDate>Thu, 02 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>training-free</category><category>process automation</category><category>long-horizon tasks</category><category>GPA</category><category>robotic process automation</category></item><item><title>HippoCamp: Benchmarking Contextual Agents on Personal Computers</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-01221/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-01221/</guid><description>HippoCamp evaluates contextual agents on personal computers by modeling individual user profiles and requiring search over massive personal file collections to support context-aware reasoning, going beyond app-isolated benchmarks. The benchmark emphasizes multimodal file management as a first-class capability for personal-computing assistants.</description><pubDate>Wed, 01 Apr 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>benchmark</category><category>contextual agent</category><category>personal files</category><category>multimodal file management</category><category>HippoCamp</category></item><item><title>Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00842/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00842/</guid><description>Pare models digital apps as finite state machines with stateful navigation to enable realistic active user simulation for proactive agents. Pare-Bench provides 143 diverse tasks spanning communication, productivity, scheduling, and lifestyle apps to test context observation, goal inference, and intervention timing.</description><pubDate>Wed, 01 Apr 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>benchmark</category><category>proactive agents</category><category>Pare</category></item><item><title>When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00892/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00892/</guid><description>The first systematic study of interruptible agents in long-horizon web navigation. It formalizes three interruption types (addition, revision, retraction) and introduces InterruptBench derived from WebArena-Lite, showing that handling mid-task user interruptions remains challenging for current LLMs.</description><pubDate>Wed, 01 Apr 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>interruptibility</category><category>InterruptBench</category><category>WebArena</category></item><item><title>PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-29318/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-29318/</guid><description>PSPA-Bench evaluates personalization in smartphone GUI agents with 12,855+ personalized instructions across 10 daily-use scenarios and 22 mobile apps. Even the strongest of 11 benchmarked agents performs poorly under personalized settings, highlighting gaps in reasoning, perception, and long-term memory.</description><pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>benchmark</category><category>dataset</category><category>personalization</category><category>PSPA-Bench</category></item><item><title>Terminal Agents Suffice for Enterprise Automation</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00073/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2604-00073/</guid><description>This paper shows that a coding agent equipped only with a terminal and filesystem can match or outperform GUI-driven and MCP tool-augmented agents for enterprise automation tasks across ServiceNow, GitLab, and ERPNext, arguing that simple programmatic API interfaces combined with strong foundation models suffice.</description><pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>enterprise automation</category><category>terminal agents</category><category>empirical study</category></item><item><title>WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/webarena-infinity-generating-browser-environments-with-verifiable-tasks-at-scale/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/webarena-infinity-generating-browser-environments-with-verifiable-tasks-at-scale/</guid><description>WebArena-Infinity automates the generation of high-authenticity web environments with verifiable tasks from static artifacts like user manuals, using a multi-agent pipeline of coding and browser-use agents. It produces 10 environments with 1,260 tasks and 2,070 trajectories. Agents achieve notably lower success rates than on manually built benchmarks, suggesting the generated tasks capture meaningful complexity.</description><pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>dataset</category><category>environment synthesis</category><category>verifiable rewards</category><category>reinforcement learning</category><category>WebArena</category></item><item><title>GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26266/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26266/</guid><description>GUIDE is a training-free add-on for desktop GUI agents that retrieves relevant tutorial videos, turns them into planning and grounding annotations, and injects that expertise into existing agents without changing model parameters. On OSWorld, it improves multiple agent families while also reducing execution steps.</description><pubDate>Fri, 27 Mar 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>video retrieval</category><category>automatic annotation</category><category>domain bias</category><category>training-free</category><category>OSWorld</category><category>GUIDE</category></item><item><title>Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26041/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26041/</guid><description>This paper studies token pruning for historical GUI screenshots and finds that background regions still carry useful state-transition cues, random pruning preserves spatial structure surprisingly well, and allocating larger token budgets to recent screenshots keeps performance nearly unchanged while reducing cost.</description><pubDate>Fri, 27 Mar 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>token pruning</category><category>historical screenshots</category><category>random pruning</category><category>recency effect</category><category>empirical study</category></item><item><title>Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26211/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26211/</guid><description>This paper adapts the discrete diffusion model LLaDA-V to GUI grounding and proposes a hybrid masking schedule for bounding-box prediction. Across web, desktop, and mobile benchmarks, the diffusion model outperforms its linear-masked variant and remains competitive with autoregressive VLMs.</description><pubDate>Fri, 27 Mar 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>GUI grounding</category><category>diffusion models</category><category>hybrid masking</category><category>bounding-box prediction</category><category>LLaDA-V</category><category>cross-platform</category></item><item><title>Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26648/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-26648/</guid><description>Vision2Web is a hierarchical benchmark for visual website development that spans static UI-to-code, interactive frontend reproduction, and full-stack website construction. It evaluates coding agents with workflow-based verification using a GUI agent verifier and a VLM judge, and shows that current models still struggle badly on full-stack tasks.</description><pubDate>Fri, 27 Mar 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>website development</category><category>agent verification</category><category>UI-to-code</category><category>full-stack development</category><category>Vision2Web</category></item><item><title>GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-25864/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-25864/</guid><description>GUIDE studies collaborative GUI assistance rather than pure task automation, using 67.5 hours of think-aloud recordings from 120 novice users across 10 software applications. It benchmarks behavior-state detection, intent prediction, and help prediction, and shows that current multimodal models still struggle to infer what users are doing and when intervention would be useful.</description><pubDate>Thu, 26 Mar 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>benchmark</category><category>collaborative assistance</category><category>behavior state detection</category><category>intent prediction</category><category>help prediction</category><category>think-aloud data</category></item><item><title>WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-25226/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-25226/</guid><description>WebTestBench studies end-to-end automated web testing rather than ordinary task completion, decomposing the problem into checklist generation and defect detection across diverse web applications. Its WebTester baseline shows that current systems still struggle with test completeness, latent logical defects, and long-horizon reliability.</description><pubDate>Thu, 26 Mar 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>automated testing</category><category>defect detection</category><category>checklist generation</category><category>latent logical defects</category><category>WebTestBench</category></item><item><title>CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24440/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24440/</guid><description>CUA-Suite is a large-scale desktop-agent data ecosystem centered on continuous expert video rather than sparse screenshots. It combines VideoCUA, UI-Vision, and GroundCUA to provide 55 hours of demonstrations, dense grounding annotations, and evaluation data across 87 professional desktop applications where current foundation action models still fail frequently.</description><pubDate>Wed, 25 Mar 2026 00:00:00 GMT</pubDate><category>Desktop</category><category>dataset</category><category>video demonstrations</category><category>desktop workflows</category><category>grounding dataset</category><category>VideoCUA</category><category>GroundCUA</category></item><item><title>Towards Automated Crowdsourced Testing via Personified-LLM</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24160/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24160/</guid><description>PersonaTester automates crowdsourced GUI testing by injecting empirically derived tester personas into LLM agents. On 15 mobile apps, it reproduces more diverse testing behaviors than non-persona baselines and triggers more crashes and functional bugs.</description><pubDate>Wed, 25 Mar 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>GUI testing</category><category>crowdsourced testing</category><category>persona-guided testing</category><category>bug finding</category><category>PersonaTester</category></item><item><title>UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24533/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-24533/</guid><description>UI-Voyager is a self-evolving mobile GUI agent that learns from failed trajectories instead of manual annotations. Its two-stage training combines rejection fine-tuning with group-relative self-distillation to turn successful rollouts into dense corrective supervision, yielding 81.0% Pass@1 on AndroidWorld with a 4B model.</description><pubDate>Wed, 25 Mar 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>reinforcement learning</category><category>self-evolving</category><category>failed trajectory learning</category><category>RFT</category><category>GRSD</category><category>AndroidWorld</category></item><item><title>AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-23007/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-23007/</guid><description>AgentRAE is a backdoor attack against screenshot-based mobile GUI agents that uses benign-looking notification icons as triggers for remote action execution. Its contrastive-pretraining plus poisoning pipeline preserves clean performance, exceeds 90% attack success over ten mobile operations, and evades eight representative defenses.</description><pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>backdoor attack</category><category>notification trigger</category><category>security</category><category>contrastive learning</category><category>Remote Action Execution</category><category>AgentRAE</category></item><item><title>CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-23559/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-23559/</guid><description>ReCAP is a native GUI agent specialized for interactive CAPTCHA solving. It builds a seven-type CAPTCHA environment, generates large-scale reasoning-action trajectories plus self-correction data from failed attempts, and improves success from about 30% to 80% without sacrificing general GUI-agent performance.</description><pubDate>Mon, 23 Mar 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>CAPTCHA</category><category>reasoning-action data generation</category><category>self-correction</category><category>native GUI agent</category><category>ReCAP</category></item><item><title>Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-22529/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-22529/</guid><description>Ego2Web is a benchmark that couples egocentric first-person videos with web tasks requiring real-world visual understanding before online interaction. It also introduces Ego2WebJudge, an LLM-as-a-judge evaluator with about 84% agreement with humans, and shows large headroom for current agents.</description><pubDate>Mon, 23 Mar 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>egocentric video</category><category>LLM-as-a-judge</category><category>web planning</category><category>Ego2Web</category><category>Ego2WebJudge</category></item><item><title>ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-20340/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-20340/</guid><description>ContractSkill turns draft web-agent skills into executable artifacts with explicit contracts, enabling deterministic verification, local fault localization, and patch-based repair instead of full skill rewrites, improving skill reliability on VisualWebArena and MiniWoB while preserving transfer across models.</description><pubDate>Fri, 20 Mar 2026 00:00:00 GMT</pubDate><category>Web</category><category>agent skill</category><category>contract</category><category>program repair</category><category>verification</category><category>cross-model transfer</category></item><item><title>AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-18429/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-18429/</guid><description>AndroTMem studies interaction memory in long-horizon Android GUI agents through a 1,069-task benchmark designed to require carrying forward critical intermediate state. It introduces Anchored State Memory, which stores causally linked state anchors and improves completion rates by 5%-30.16% over replay and summary baselines across 12 agents.</description><pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>benchmark</category><category>agent memory</category><category>long-horizon tasks</category><category>Anchored State Memory</category><category>AndroTMem-Bench</category><category>AndroTMem</category></item><item><title>OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-19191/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-19191/</guid><description>OS-Themis is a scalable critic framework for GUI reward modeling that breaks trajectories into verifiable milestones and audits the evidence chain before issuing a verdict. It improves AndroidWorld training and filtering loops and introduces OmniGUIRewardBench as a cross-platform benchmark for GUI outcome rewards.</description><pubDate>Thu, 19 Mar 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>reward model</category><category>reinforcement learning</category><category>critic framework</category><category>OmniGUIRewardBench</category><category>milestone decomposition</category><category>OS-Themis</category></item><item><title>AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17441/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17441/</guid><description>AdaZoom-GUI targets two concrete GUI-grounding bottlenecks: ambiguous natural-language instructions and tiny UI elements in high-resolution screenshots. It combines instruction rewriting with a conditional second-stage zoom-in pass and reports state-of-the-art grounding performance among comparable model sizes.</description><pubDate>Wed, 18 Mar 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>GUI grounding</category><category>instruction refinement</category><category>adaptive zoom</category><category>GRPO</category><category>bounding-box prediction</category><category>AdaZoom-GUI</category></item><item><title>FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17826/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17826/</guid><description>FailureMem is a multimodal automated program repair framework for settings where repair requires joint reasoning over code, issue text, and GUI screenshots. It combines hybrid workflow-agent control, region-level visual grounding, and a Failure Memory Bank that converts failed repair attempts into reusable guidance, improving resolved rate over GUIRepair on SWE-bench Multimodal.</description><pubDate>Wed, 18 Mar 2026 00:00:00 GMT</pubDate><category>General GUI</category><category>multimodal program repair</category><category>software engineering</category><category>program repair</category><category>failure memory</category><category>SWE-bench Multimodal</category><category>FailureMem</category></item><item><title>WebPII: Benchmarking Visual PII Detection for Computer-Use Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17357/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-17357/</guid><description>WebPII is a benchmark for detecting personally identifiable information in web screenshots, centered on e-commerce interfaces where agents may expose sensitive content during browsing and form completion. It extends the PII taxonomy to transaction-level identifiers and partially filled forms, and pairs the benchmark with WebRedact to show practical privacy-preserving deployment.</description><pubDate>Wed, 18 Mar 2026 00:00:00 GMT</pubDate><category>Web</category><category>benchmark</category><category>privacy</category><category>PII detection</category><category>visual redaction</category><category>e-commerce UI</category><category>WebPII</category></item><item><title>GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents</title><link>https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-15039/</link><guid isPermaLink="true">https://osu-nlp-group.github.io/GUI-Agents-Paper-List/papers/arxiv-2603-15039/</guid><description>GUI-CEval is the first comprehensive Chinese benchmark for mobile GUI agents, spanning 201 apps across four device types with a hierarchical two-level evaluation structure (atomic abilities and application-level tasks) along five dimensions (perception, planning, reflection, execution, evaluation), revealing that most MLLMs still struggle with reflective decision-making and post-action self-evaluation.</description><pubDate>Mon, 16 Mar 2026 00:00:00 GMT</pubDate><category>Mobile</category><category>benchmark</category><category>chinese</category><category>hierarchical evaluation</category><category>physical-device evaluation</category><category>GUI-CEval</category></item></channel></rss>