TimeWarp: Evaluating Web Agents by Revisiting the Past

🏛 Institutions: University of Utah
📅 Date: March 5, 2026
📑 Publisher: arXiv
💻 Env: Web
🔑 Keywords: benchmark evolving interfaces plan distillation TimeWarp TimeTraj behavior cloning

TLDR

TimeWarp evaluates web agents under interface drift by recreating multiple historical UI versions of the same environments. The paper shows current agents are brittle to design changes and introduces TimeTraj, which distills plans across versions to improve robustness.

Open paper arXiv Report issue

Related papers (24)

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

April 27, 2026 · arXiv
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

April 13, 2026 · arXiv
The Amazing Agent Race: Strong Tool Users, Weak Navigators

April 11, 2026 · arXiv
ClawBench: Can AI Agents Complete Everyday Online Tasks?

April 9, 2026 · arXiv
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

April 8, 2026 · arXiv
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

April 7, 2026 · arXiv
The Art of Building Verifiers for Computer Use Agents

April 5, 2026 · arXiv
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

April 1, 2026 · arXiv
WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale

March 2026 · Blog Post
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

March 27, 2026 · arXiv
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

March 26, 2026 · arXiv
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

March 23, 2026 · CVPR 2026
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

March 18, 2026 · arXiv
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

March 5, 2026 · arXiv
Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

February 19, 2026 · arXiv
PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

February 5, 2026 · arXiv
WebTrap Park: An Automated Platform for Systematic Security Evaluation of Web Agents

January 13, 2026 · arXiv
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

January 5, 2026 · arXiv
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

December 29, 2025 · arXiv
DECEPTICON: How Dark Patterns Manipulate Web Agents

December 28, 2025 · arXiv
VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

December 18, 2025 · arXiv
OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

December 18, 2025 · arXiv
LegalWebAgent: Empowering Access to Justice via LLM-Based Web Agents

November 28, 2025 · arXiv
Investigating the Impact of Dark Patterns on LLM-Based Web Agents

October 20, 2025 · IEEE S&P 2026