CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents

🏛 Institutions: Ukrainian Catholic University
📅 Date: March 11, 2026
📑 Publisher: HEAL @ CHI 2026 Workshop
💻 Env: Desktop
🔑 Keywords: evaluation VLM judge agent-as-a-judge calibration meta-evaluation CUAAudit

TLDR

CUAAudit studies vision-language models as autonomous judges of desktop-agent task success from observable interactions alone. Across multiple operating-system benchmarks, it finds that even strong VLM auditors degrade on harder environments and disagree substantially with one another, highlighting limits of model-based auditing.

Open paper arXiv Report issue