When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

Jaylen Jones , Zhehao Zhang , Yuting Ning , Eric Fosler-Lussier , Pierre-Luc St-Charles , Yoshua Bengio , Dawn Song , Yu Su , Huan Sun

🏛 Institutions: OSU , LawZero , Mila , UdeM , UC Berkeley
📅 Date: February 9, 2026
📑 Publisher: arXiv
💻 Env: Desktop
🔑 Keywords: safety unintended behaviors AutoElicit benchmark red teaming

TLDR

AutoElicit is an agentic framework that iteratively perturbs benign instructions using CUA execution feedback to surface unintended harmful behaviors while keeping inputs realistic. It elicits severe harms from frontier CUAs like Claude 4.5 and Operator in up to 72.5% of OS-domain seeds, and evaluates cross-model transferability of verified perturbations.

Open paper arXiv Report issue