mcpmark / docs /introduction.md
haochengsama's picture
Add files using upload-large-folder tool
97cb846 verified
|
Raw
History Blame Contribute Delete
3.1 kB
# MCPMark
MCPMark is a comprehensive evaluation suite for evaluating the agentic ability of frontier models.
MCPMark includes Model Context Protocol (MCP) service in following environments
- Notion
- Github
- Filesystem
- Postgres
- Playwright
- Playwright-WebArena
### General Procedure
MCPMark is designed to run agentic tasks in complex environment **safely**. Specifically, it sets up an isolated environment for the experiment, completing the task, and then destroy the environment without affecting existing user profile or information.
### How to Use MCPMark
1. MCPMark Installation.
2. Authorize service (for Github and Notion).
3. Configure the environment variables in `.mcp_env`.
4. Run MCPMark experiment.
Please refer to [Quick Start](./quickstart.md) for details regarding how to start a sample filesystem experiment in properly, and [Task Page](./datasets/task.md) for task details. Please visit [Installation and Docker Uusage](./installation_and_docker_usage.md) information of full MCPMark setup.
### Running MCPMark
MCPMark supports the following mode to run experiments (suppose the experiment is named as new_exp, and the model used are o3 and gpt-4.1 and the environment is notion), with K repetive experiments.
#### MCPMark in Pip Installation
```bash
# Evaluate ALL tasks
python -m pipeline --exp-name new_exp --mcp notion --tasks all --models o3 --k K
# Evaluate a single task group (online_resume)
python -m pipeline --exp-name new_exp --mcp notion --tasks online_resume --models o3 --k K
# Evaluate one specific task (task_1 in online_resume)
python -m pipeline --exp-name new_exp --mcp notion --tasks online_resume/task_1 --models o3 --k K
# Evaluate multiple models
python -m pipeline --exp-name new_exp --mcp notion --tasks all --models o3,gpt-4.1 --k K
```
#### MCPMark in Docker Installation
```bash
# Run all tasks for one service
./run-task.sh --mcp notion --models o3 --exp-name new_exp --tasks all
# Run comprehensive benchmark across all services
./run-benchmark.sh --models o3,gpt-4.1 --exp-name new_exp --docker
```
#### Experiment Auto-Resume
For re-run experiments, only unfinished tasks will be executed. Tasks that previously failed due to pipeline errors (such as State Duplication Error or MCP Network Error) will also be retried automatically.
### Results
The experiment results are written to `./results/` (JSON + CSV).
#### Reult Aggregation (for K > 1)
MCP supports aggreated metrics of pass@1, pass@K, pass^K, avg@K.
```bash
python -m src.aggregators.aggregate_results --exp-name new_exp
```
### Model Support
MCPMark supports the following models with according providers (model codes in the brackets).
#### OpenAI
- GPT-5 (gpt-5)
- o3 (o3)
#### Anthropic
- Claude-4.1-Opus (claude-4.1-opus)
- Claude-4-Sonnet (claude-4-sonnet)
#### Google
- Gemini-2.5-Pro (gemini-2.5-pro)
#### Grok
- Grok-4 (grok-4)
#### Deepseek
- DeepSeek-Chat (deepseek-chat)
#### Alibaba
- Qwen3-Coder (qwen-3-coder)
#### Kimi
- Kimi-K2 (k2)
### Want to contribute?
Visit [Contributing Page](./contributing) to learn how to make contribution to MCPMark.