RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
Suyu Ye , Haojun Shi , Darren Shih , Hyokun Yun , Tanya G. Roosta , Tianmin Shu
- 🏛 Institutions
- JHU , Amazon
- 📅 Date
- April 14, 2025
- 📑 Publisher
- AAAI 2026
- 💻 Env
- Web
- 🔑 Keywords
TLDR
RealWebAssist benchmarks long-horizon web assistance with sequential instructions collected from real users rather than isolated single-task prompts. Its dataset spans 1,885 instructions across 107 tasks on 66 websites and highlights challenges such as ambiguous intent, evolving user goals, routine understanding, and grounding actions to the right GUI elements.
Related papers (24)
- WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent BenchmarkApril 13, 2026 · arXiv
- WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at ScaleMarch 2026 · Blog Post
- WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction TracesMarch 5, 2026 · arXiv
- Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web AgentsAugust 3, 2025 · ICLR 2026 (Poster)
- Web-Shepherd: Advancing PRMs for Reinforcing Web AgentsMay 21, 2025 · NeurIPS 2025 (Spotlight)
- AgentRewardBench: Evaluating Automatic Evaluations of Web Agent TrajectoriesApril 11, 2025 · COLM 2025
- WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work TasksJuly 7, 2024 · NeurIPS 2024 Datasets and Benchmarks Track (Poster)
- GUI Action Narrator: Where and When Did That Action Take Place?June 19, 2024 · arXiv
- WebCanvas: Benchmarking Web Agents in Online EnvironmentsJune 18, 2024 · arXiv
- GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented UnderstandingJune 16, 2024 · ICLR 2025 (Poster)
- OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and WebFebruary 29, 2024 · ECCV 2024 (Poster)
- On the Multi-turn Instruction Following for Conversational Web AgentsFebruary 23, 2024 · ACL 2024
- WebLINX: Real-World Website Navigation with Multi-Turn DialogueFebruary 8, 2024 · ICML 2024
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI AgentsJanuary 17, 2024 · ACL 2024
- WebVLN: Vision-and-Language Navigation on WebsitesDecember 25, 2023 · AAAI 2024
- Mind2Web: Towards a Generalist Agent for the WebJune 9, 2023 · NeurIPS 2023 Datasets and Benchmarks Track
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language AgentsJuly 31, 2022 · NeurIPS 2022
- Grounding Open-Domain Instructions to Automate Web Support TasksMarch 30, 2021 · NAACL 2021
- WebSRC: A Dataset for Web-Based Structural Reading ComprehensionJanuary 23, 2021 · EMNLP 2021
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- PSPA-Bench: A Personalized Benchmark for Smartphone GUI AgentMarch 31, 2026 · arXiv
- SecAgent: Efficient Mobile GUI Agent with Semantic ContextMarch 9, 2026 · arXiv
- Turing Test on Screen: A Benchmark for Mobile GUI Agent HumanizationFebruary 24, 2026 · arXiv
- AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the WildFebruary 12, 2026 · arXiv