# MCPMark MCPMark is a comprehensive evaluation suite for evaluating the agentic ability of frontier models. MCPMark includes Model Context Protocol (MCP) service in following environments - Notion - Github - Filesystem - Postgres - Playwright - Playwright-WebArena ### General Procedure MCPMark is designed to run agentic tasks in complex environment **safely**. Specifically, it sets up an isolated environment for the experiment, completing the task, and then destroy the environment without affecting existing user profile or information. ### How to Use MCPMark 1. MCPMark Installation. 2. Authorize service (for Github and Notion). 3. Configure the environment variables in `.mcp_env`. 4. Run MCPMark experiment. Please refer to [Quick Start](./quickstart.md) for details regarding how to start a sample filesystem experiment in properly, and [Task Page](./datasets/task.md) for task details. Please visit [Installation and Docker Uusage](./installation_and_docker_usage.md) information of full MCPMark setup. ### Running MCPMark MCPMark supports the following mode to run experiments (suppose the experiment is named as new_exp, and the model used are o3 and gpt-4.1 and the environment is notion), with K repetive experiments. #### MCPMark in Pip Installation ```bash # Evaluate ALL tasks python -m pipeline --exp-name new_exp --mcp notion --tasks all --models o3 --k K # Evaluate a single task group (online_resume) python -m pipeline --exp-name new_exp --mcp notion --tasks online_resume --models o3 --k K # Evaluate one specific task (task_1 in online_resume) python -m pipeline --exp-name new_exp --mcp notion --tasks online_resume/task_1 --models o3 --k K # Evaluate multiple models python -m pipeline --exp-name new_exp --mcp notion --tasks all --models o3,gpt-4.1 --k K ``` #### MCPMark in Docker Installation ```bash # Run all tasks for one service ./run-task.sh --mcp notion --models o3 --exp-name new_exp --tasks all # Run comprehensive benchmark across all services ./run-benchmark.sh --models o3,gpt-4.1 --exp-name new_exp --docker ``` #### Experiment Auto-Resume For re-run experiments, only unfinished tasks will be executed. Tasks that previously failed due to pipeline errors (such as State Duplication Error or MCP Network Error) will also be retried automatically. ### Results The experiment results are written to `./results/` (JSON + CSV). #### Reult Aggregation (for K > 1) MCP supports aggreated metrics of pass@1, pass@K, pass^K, avg@K. ```bash python -m src.aggregators.aggregate_results --exp-name new_exp ``` ### Model Support MCPMark supports the following models with according providers (model codes in the brackets). #### OpenAI - GPT-5 (gpt-5) - o3 (o3) #### Anthropic - Claude-4.1-Opus (claude-4.1-opus) - Claude-4-Sonnet (claude-4-sonnet) #### Google - Gemini-2.5-Pro (gemini-2.5-pro) #### Grok - Grok-4 (grok-4) #### Deepseek - DeepSeek-Chat (deepseek-chat) #### Alibaba - Qwen3-Coder (qwen-3-coder) #### Kimi - Kimi-K2 (k2) ### Want to contribute? Visit [Contributing Page](./contributing) to learn how to make contribution to MCPMark.