Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Shoubin Yu , Lei Shu , Antoine Yang , Yao Fu , Srinivas Sunkara , Maria Wang , Jindong Chen , Mohit Bansal , Boqing Gong

🏛 Institutions: Google DeepMind , UNC
📅 Date: March 23, 2026
📑 Publisher: CVPR 2026
💻 Env: Web
🔑 Keywords: benchmark egocentric video LLM-as-a-judge web planning Ego2Web Ego2WebJudge

TLDR

Ego2Web is a benchmark that couples egocentric first-person videos with web tasks requiring real-world visual understanding before online interaction. It also introduces Ego2WebJudge, an LLM-as-a-judge evaluator with about 84% agreement with humans, and shows large headroom for current agents.

Open paper arXiv Report issue

Related papers (24)

An Illusion of Progress? Assessing the Current State of Web Agents

April 2, 2025 · COLM 2025
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

April 27, 2026 · arXiv
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

April 13, 2026 · arXiv
The Amazing Agent Race: Strong Tool Users, Weak Navigators

April 11, 2026 · arXiv
ClawBench: Can AI Agents Complete Everyday Online Tasks?

April 9, 2026 · arXiv
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

April 8, 2026 · arXiv
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

April 7, 2026 · arXiv
The Art of Building Verifiers for Computer Use Agents

April 5, 2026 · arXiv
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

April 1, 2026 · arXiv
WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale

March 2026 · Blog Post
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

March 27, 2026 · arXiv
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

March 26, 2026 · arXiv
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

March 18, 2026 · arXiv
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

March 5, 2026 · arXiv
TimeWarp: Evaluating Web Agents by Revisiting the Past

March 5, 2026 · arXiv
Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

February 19, 2026 · arXiv
PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

February 5, 2026 · arXiv
WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

January 13, 2026 · arXiv
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

January 5, 2026 · arXiv
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

December 29, 2025 · arXiv
DECEPTICON: How Dark Patterns Manipulate Web Agents

December 28, 2025 · arXiv
VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

December 18, 2025 · arXiv
OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

December 18, 2025 · arXiv
LegalWebAgent: Empowering Access to Justice via LLM-Based Web Agents

November 28, 2025 · arXiv