Agentic Test-Time Scaling for WebAgents

Nicholas Lee , Lutfi Eren Erdogan , Chris Joseph John , Surya Krishnapillai , Michael W. Mahoney , Kurt Keutzer , Amir Gholami

🏛 Institutions: UC Berkeley , ICSI , LBNL
📅 Date: February 12, 2026
📑 Publisher: arXiv
💻 Env: Web
🔑 Keywords: test-time scaling CATTS inference-time compute uncertainty estimation LLM arbiter WebArena-Lite

TLDR

CATTS dynamically allocates test-time compute for multi-step web agents by using vote-based uncertainty signals to invoke an LLM arbiter only on contentious decisions. It improves performance on WebArena-Lite and GoBrowse while using fewer tokens than uniform scaling.

Open paper arXiv Report issue

Related papers (24)

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

May 22, 2025 · EMNLP 2025 (Poster)
WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation

October 26, 2025 · NeurIPS 2025 Workshop on Language Agents and World Models
JEF-Hinter: Leveraging Offline Knowledge for Improving Web Agents Adaptation

October 5, 2025 · arXiv
Test‑Time Reinforcement Learning for GUI Grounding via Region Consistency

August 7, 2025 · AAAI 2026
ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search

May 21, 2025 · arXiv
GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models

January 26, 2026 · arXiv
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

December 5, 2025 · arXiv
Scaling Agents for Computer Use

October 2, 2025 · arXiv
GUI Agents for Continual Game Generation

May 27, 2026 · arXiv
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

April 27, 2026 · arXiv
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

April 13, 2026 · arXiv
The Amazing Agent Race: Strong Tool Users, Weak Navigators

April 11, 2026 · arXiv
Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems

April 9, 2026 · arXiv
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

April 9, 2026 · arXiv
ClawBench: Can AI Agents Complete Everyday Online Tasks?

April 9, 2026 · arXiv
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

April 8, 2026 · arXiv
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

April 7, 2026 · arXiv
The Art of Building Verifiers for Computer Use Agents

April 5, 2026 · arXiv
The Tool Illusion: Rethinking Tool Use in Web Agents

April 3, 2026 · arXiv
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

April 1, 2026 · arXiv
WebArena-Infinity: Generating Browser Environments with Verifiable Tasks at Scale

March 2026 · Blog Post
Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

March 27, 2026 · arXiv
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

March 26, 2026 · arXiv
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

March 23, 2026 · CVPR 2026