GUI Agents Papers
Star · 751

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Nir Mashkif, Segev Shlomov

🏛 Institutions
IBM Research
📅 Date
October 9, 2024
📑 Publisher
ICLR 2026 (Poster)
💻 Env
Web
🔑 Keywords
TLDR

ST-WebAgentBench is a benchmark for enterprise-style web-agent evaluation that pairs 375 tasks with 3,057 safety and trustworthiness policies and introduces policy-aware metrics such as Completion Under Policy (CuP) and Risk Ratio. The paper shows that strong agents lose a large fraction of their nominal completion rate once policy compliance is required.

Open paper Edit on GitHub Report issue
Related papers