When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
Jaylen Jones , Zhehao Zhang , Yuting Ning , Eric Fosler-Lussier , Pierre-Luc St-Charles , Yoshua Bengio , Dawn Song , Yu Su , Huan Sun
- 🏛 Institutions
- OSU , LawZero , Mila , UdeM , UC Berkeley
- 📅 Date
- February 9, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop
- 🔑 Keywords
TLDR
AutoElicit is an agentic framework that iteratively perturbs benign instructions using CUA execution feedback to surface unintended harmful behaviors while keeping inputs realistic. It elicits severe harms from frontier CUAs like Claude 4.5 and Operator in up to 72.5% of OS-domain seeds, and evaluates cross-model transferability of verified perturbations.
Related papers (24)
- Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsOctober 11, 2024 · arXiv
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsApril 12, 2026 · arXiv
- When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use AgentsFebruary 9, 2026 · arXiv
- macOSWorld: A Multilingual Interactive Benchmark for GUI AgentsJune 4, 2025 · NeurIPS 2025 (Poster)
- RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use AgentsMay 31, 2025 · NeurIPS 2025 (Poster)
- CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI AutomationApril 10, 2026 · arXiv
- Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element InjectionApril 9, 2026 · arXiv
- LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial ScenariosFebruary 3, 2026 · arXiv
- WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web AgentsJanuary 13, 2026 · arXiv
- It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web AgentsDecember 29, 2025 · arXiv
- DECEPTICON: How Dark Patterns Manipulate Web AgentsDecember 28, 2025 · arXiv
- Investigating the Impact of Dark Patterns on LLM-Based Web AgentsOctober 20, 2025 · IEEE S&P 2026
- MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device ControlOctober 23, 2024 · arXiv
- AdvAgent: Controllable Blackbox Red-teaming on Web AgentsOctober 22, 2024 · ICML 2025 (Poster)
- ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web AgentsOctober 9, 2024 · ICLR 2026 (Poster)
- Dissecting Adversarial Robustness of Multimodal LM AgentsJune 18, 2024 · ICLR 2025 (Poster)
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsJune 9, 2026 · arXiv
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration TasksApril 10, 2026 · arXiv
- Preference Redirection via Attention Concentration: An Attack on Computer Use AgentsApril 9, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- HippoCamp: Benchmarking Contextual Agents on Personal ComputersApril 1, 2026 · arXiv
- PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation AgentsMarch 9, 2026 · arXiv
- OSExpert: Computer-Use Agents Learning Professional Skills via ExplorationMarch 9, 2026 · arXiv