ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Ido Levy , Ben Wiesel , Sami Marreed , Alon Oved , Avi Yaeli , Nir Mashkif , Segev Shlomov

🏛 Institutions: IBM Research
📅 Date: October 9, 2024
📑 Publisher: ICLR 2026 (Poster)
💻 Env: Web
🔑 Keywords: benchmark safety trustworthiness policy compliance CuP ST-WebAgentBench

TLDR

ST-WebAgentBench is a benchmark for enterprise-style web-agent evaluation that pairs 375 tasks with 3,057 safety and trustworthiness policies and introduces policy-aware metrics such as Completion Under Policy (CuP) and Risk Ratio. The paper shows that strong agents lose a large fraction of their nominal completion rate once policy compliance is required.

Open paper Report issue