Reliable Weak-to-Strong Monitoring of LLM Agents

Neil Kale , Chen Bo Calvin Zhang , Kevin Zhu , Ankit Aich , Paula Rodriguez , Scale Red Team , Christina Q. Knight , Zifan Wang

🏛 Institutions: Scale AI , Carnegie Mellon University , Massachusetts Institute of Technology
📅 Date: August 26, 2025
📑 Publisher: ICLR 2025
💻 Env
🔑 Keywords: safety LLM agent monitoring red-teaming

TLDR

Stress-tests LLM agent monitoring systems for detecting covert misbehavior using a monitor red-teaming (MRT) workflow varying agent/monitor awareness and adversarial evasion strategies, evaluated on SHADE-Arena for tool-calling agents and CUA-SHADE-Arena for computer-use agents.

Open paper arXiv Report issue

Related papers (8)

Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

October 21, 2025 · ICME 2026
Human-Guided Harm Recovery for Computer Use Agents

April 20, 2026 · arXiv
The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

April 12, 2026 · arXiv
CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation

April 10, 2026 · arXiv
Preference Redirection via Attention Concentration: An Attack on Computer Use Agents

April 9, 2026 · arXiv
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

April 9, 2026 · arXiv
Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

March 4, 2026 · arXiv
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

February 9, 2026 · arXiv