HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
Suhana Bedi, Ryan Welch, Ethan Steinberg, Michael Wornow, Taeil Matthew Kim, Haroun Ahmed, Peter Sterling, Bravim Purohit, Qurat Akram, Angelic Acosta, Esther Nubla, Pritika Sharma, Michael A. Pfeffer, Sanmi Koyejo, Nigam H. Shah
- 🏛 Institutions
- Stanford
- 📅 Date
- April 10, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop
- 🔑 Keywords
TLDR
HealthAdminBench evaluates computer-use agents on healthcare administration via 4 realistic GUI environments (EHR, two payer portals, fax) and 135 expert-defined tasks decomposed into 1,698 subtasks. The best agent (Claude Opus 4.6 CUA) reaches only 36.3% end-to-end despite 82.8% subtask success, exposing a large gap to real-world reliability.
Related papers
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive TasksJanuary 28, 2026 · arXiv
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI AgentsMarch 19, 2026 · arXiv
- MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World EnvironmentJanuary 28, 2026 · arXiv