HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
Suhana Bedi , Ryan Welch , Ethan Steinberg , Michael Wornow , Taeil Matthew Kim , Haroun Ahmed , Peter Sterling , Bravim Purohit , Qurat Akram , Angelic Acosta , Esther Nubla , Pritika Sharma , Michael A. Pfeffer , Sanmi Koyejo , Nigam H. Shah
- 🏛 Institutions
- Stanford
- 📅 Date
- April 10, 2026
- 📑 Publisher
- arXiv
- 💻 Env
- Desktop
- 🔑 Keywords
TLDR
HealthAdminBench evaluates computer-use agents on healthcare administration via 4 realistic GUI environments (EHR, two payer portals, fax) and 135 expert-defined tasks decomposed into 1,698 subtasks. The best agent (Claude Opus 4.6 CUA) reaches only 36.3% end-to-end despite 82.8% subtask success, exposing a large gap to real-world reliability.
Related papers (24)
- Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional FieldsJune 9, 2026 · arXiv
- Gym-Anything: Turn any Software into an Agent EnvironmentApril 7, 2026 · arXiv
- OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive TasksJanuary 28, 2026 · arXiv
- SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent BenchmarkingMay 24, 2026 · arXiv
- CocoaBench: Evaluating Unified Digital Agents in the WildApril 13, 2026 · arXiv
- ClawBench: Can AI Agents Complete Everyday Online Tasks?April 9, 2026 · arXiv
- AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI AgentsMarch 19, 2026 · arXiv
- MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World EnvironmentJanuary 28, 2026 · arXiv
- LongHorizonUI: A Unified Framework for Robust long-horizon Task Automation of GUI AgentJanuary 26, 2026 · ICLR 2026 (Poster)
- MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented EnvironmentsDecember 22, 2025 · arXiv
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?October 21, 2024 · EMNLP 2024 (Poster)
- WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application EnvironmentsApril 30, 2026 · arXiv
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsApril 12, 2026 · arXiv
- GPA: Learning GUI Process Automation from DemonstrationsApril 2, 2026 · arXiv
- HippoCamp: Benchmarking Contextual Agents on Personal ComputersApril 1, 2026 · arXiv
- PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation AgentsMarch 9, 2026 · arXiv
- OSExpert: Computer-Use Agents Learning Professional Skills via ExplorationMarch 9, 2026 · arXiv
- When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use AgentsFebruary 9, 2026 · arXiv
- When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use AgentsFebruary 9, 2026 · arXiv
- EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI AgentsJanuary 25, 2026 · arXiv
- MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning CorrectionJanuary 19, 2026 · arXiv
- ShowUI-π: Flow-based Generative Models as GUI Dexterous HandsDecember 31, 2025 · arXiv
- VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding TasksDecember 18, 2025 · arXiv
- OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic ModelsDecember 18, 2025 · arXiv