MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Ziyang Luo , Zhiqi Shen , Wenzhuo Yang , Zirui Zhao , Prathyusha Jwalapuram , Amrita Saha , Doyen Sahoo , Silvio Savarese , Caiming Xiong , Junnan Li

🏛 Institutions: Salesforce AI Research
📅 Date: August 20, 2025
📑 Publisher: arXiv
💻 Env
🔑 Keywords: benchmark dataset framework long-horizon reasoning unknown-tools challenge execution-based evaluation MCP-universe

TLDR

MCP-Universe introduces the first comprehensive benchmark for evaluating large language models (LLMs) through interactions with real-world Model Context Protocol (MCP) servers. It spans six core domains—Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching—across 11 MCP servers. The benchmark employs execution-based evaluators (format, static, dynamic) to rigorously assess agent performance. Despite progress, state-of-the-art models like GPT-5 (43.72% success), Grok-4 (33.33%), and Claude-4.0-Sonnet (29.44%) show significant limitations. The benchmark highlights challenges in long-context reasoning and unfamiliar tool handling, and provides an open-source extensible evaluation framework with UI support to accelerate future research.

Open paper arXiv Report issue