WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
Yaoyao Qian , Yuanli Wang , Jinda Zhang , Yun Zong , Meixu Chen , Hanhan Zhou , Jindan Huang , Yifan Zeng , Xinyu Hu , Chan Hee Song , Danqing Zhang
- 🏛 Institutions
- Northeastern University , Boston University , University of Victoria , University of Minnesota , George Washington University , Tufts University , Oregon State University , University of Texas at San Antonio , OSU , PathOnAI.org
- 📅 Date
- October 22, 2025
- 📑 Publisher
- NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models
- 💻 Env
- Web
- 🔑 Keywords
TLDR
WebGraphEval evaluates web agents by converting many interaction trajectories into a unified weighted action graph instead of scoring only final success or conformity to one reference path. This graph view highlights redundancy, inefficiency, and critical decision points across agents and benchmark runs.
Related papers (24)
- Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search SystemsApril 9, 2026 · arXiv
- AI Planning Framework for LLM-Based Web AgentsMarch 13, 2026 · arXiv
- An Illusion of Progress? Assessing the Current State of Web AgentsApril 2, 2025 · COLM 2025
- GUIDE: Interpretable GUI Agent Evaluation via Hierarchical DiagnosisApril 6, 2026 · arXiv
- CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use AgentsMarch 11, 2026 · HEAL @ CHI 2026 Workshop
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic EnvironmentsFebruary 3, 2026 · arXiv
- SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy TasksJanuary 26, 2026 · ICLR 2026 (Poster)
- Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI AgentsDecember 14, 2025 · arXiv
- Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile AgentsMay 17, 2025 · arXiv
- GUI Agents: A SurveyDecember 18, 2024 · Findings of ACL 2025
- GUI Agents for Continual Game GenerationMay 27, 2026 · arXiv
- Odysseys: Benchmarking Web Agents on Realistic Long Horizon TasksApril 27, 2026 · arXiv
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- The Amazing Agent Race: Strong Tool Users, Weak NavigatorsApril 11, 2026 · arXiv
- MolmoWeb: Open Visual Web Agent and Open Data for the Open WebApril 9, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game AgentsApril 8, 2026 · arXiv
- WebSP-Eval: Evaluating Web Agents on Website Security and Privacy TasksApril 7, 2026 · arXiv
- The Art of Building Verifiers for Computer Use AgentsApril 5, 2026 · arXiv
- The Tool Illusion: Rethinking Tool Use in Web AgentsApril 3, 2026 · arXiv
- When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web NavigationApril 1, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent VerificationMarch 27, 2026 · arXiv
- WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web TestingMarch 26, 2026 · arXiv