| # Installation and Docker Task Usage Guideline |
|
|
| ## Overview |
|
|
| The MCPMark setup supports installation through either pip or MCPMark Docker (recommended) after cloning the code repository. |
|
|
| ### Pip Installtion |
| ```bash |
| pip install -e . |
| ``` |
|
|
| The MCPMark Docker setup provides a simple way to run evaluation tasks in isolated containers. PostgreSQL is automatically handled when needed. |
|
|
| ## 1. Quick Start |
|
|
| ### 1.1 Docker Image |
|
|
| The official Docker image is automatically pulled from Docker Hub on first use. |
| The image is hosted at: https://hub.docker.com/r/evalsysorg/mcpmark |
|
|
| **Image Management:** |
| - The scripts automatically download the image when it's not found locally |
| - To manually update to the latest version: |
| ```bash |
| docker pull evalsysorg/mcpmark:latest |
| ``` |
| - For local development/testing, you can build your own docker: |
| ```bash |
| # Creates evalsysorg/mcpmark:latest locally |
| ./build-docker.sh |
| ``` |
|
|
| ## 2. Running MCP Experiments |
|
|
| ### 2.1 Running Individual MCP Experiment |
|
|
| The `run-task.sh` script provides simplified Docker usage: |
|
|
| ```bash |
| # Run filesystem tasks (filesystem is the default mcp service) |
| ./run-task.sh --models MODEL_NAME --k K |
| |
| # Run github/notion/postgres/playwright/playwright_webarena with specific task |
| ./run-task.sh --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK --k K |
| ``` |
|
|
| where *MODEL_NAME* refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), *EXPNAME* refers to customized experiment name, *TASK* refers to specific task or task group (see `tasks/<mcp>/<task_suite>/...` for more information), *K* refers to the time of independent experiments. |
|
|
|
|
| Additionally, the `run-benchmark.sh` script evaluates models across all MCP services: |
|
|
| ```bash |
| # Run all services with Docker (recommended) |
| ./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker |
| |
| # Run specific services |
| ./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES --docker |
| |
| # Run with parallel execution for faster results |
| ./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker --parallel |
| |
| # Run locally without Docker |
| ./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES |
| ``` |
|
|
| Here *MCPSERVICES* refers to group of MCP services, separated by comma (e.g. *filesystem,postgres*) |
|
|
| The benchmark script: |
| - Runs all or selected MCP services automatically |
| - Supports progress tracking and timing |
| - Generates summary reports and logs |
| - Supports parallel service execution |
| - Continues running even if some services fail |
| - Automatically generates performance dashboards |
|
|
| ### Manual Docker Commands |
|
|
| #### For Non-Postgres Services |
| Suppose Notion is the service: |
| ```bash |
| # Build the image first |
| ./build-docker.sh |
| |
| # Run a task |
| docker run --rm \ |
| -v $(pwd)/results:/app/results \ |
| -v $(pwd)/.mcp_env:/app/.mcp_env:ro \ |
| -v $(pwd)/notion_state.json:/app/notion_state.json:ro \ |
| evalsysorg/mcpmark:latest \ |
| python3 -m pipeline --mcp notion --models MODEL --exp-name EXPNAME --tasks TASK --k K |
| ``` |
|
|
| #### For Postgres Service |
| ```bash |
| # The run-task.sh script handles postgres automatically, but if doing manually: |
| |
| # Start postgres container |
| docker run -d \ |
| --name mcp-postgres \ |
| --network mcp-network \ |
| -e POSTGRES_DATABASE=postgres \ |
| -e POSTGRES_USER=postgres \ |
| -e POSTGRES_PASSWORD=123456 \ |
| ghcr.io/cloudnative-pg/postgresql:17-bookworm |
| |
| # Run postgres task |
| docker run --rm \ |
| --network mcp-network \ |
| -e POSTGRES_HOST=mcp-postgres \ |
| -v $(pwd)/results:/app/results \ |
| -v $(pwd)/.mcp_env:/app/.mcp_env:ro \ |
| evalsysorg/mcpmark:latest \ |
| python3 -m pipeline --mcp postgres --models MODEL --exp-name EXPNAME --tasks TASK --k K |
| |
| # Stop and remove postgres when done |
| docker stop mcp-postgres && docker rm mcp-postgres |
| ``` |
|
|
| ## Script Usage |
|
|
| ### Benchmark Runner (`run-benchmark.sh`) |
|
|
| ``` |
| ./run-benchmark.sh --models MODELS --exp-name NAME [OPTIONS] |
| |
| Required Options: |
| --models MODELS Comma-separated list of models to evaluate |
| --exp-name NAME Experiment name for organizing results |
| |
| Optional Options: |
| --docker Run tasks in Docker containers (recommended) |
| --mcps SERVICES Comma-separated list of services to test |
| Default: filesystem,notion,github,postgres,playwright |
| --parallel Run services in parallel (experimental) |
| --timeout SECONDS Timeout per task in seconds (default: 300) |
| ``` |
|
|
| ### Individual Task Runner (`run-task.sh`) |
|
|
| ``` |
| ./run-task.sh [--mcp SERVICE] [PIPELINE_ARGS] |
| |
| Options: |
| --mcp SERVICE MCP service (notion|github|filesystem|playwright|postgres) |
| Default: filesystem |
| |
| Environment Variables: |
| DOCKER_MEMORY_LIMIT Memory limit for container (default: 4g) |
| DOCKER_CPU_LIMIT CPU limit for container (default: 2) |
| DOCKER_IMAGE_VERSION Docker image tag to use (default: latest) |
| |
| All other arguments are passed directly to the pipeline command. |
| |
| Pipeline arguments (see python3 -m pipeline --help): |
| --mcp {notion,github,filesystem,playwright,postgres,playwright_webarena} |
| MCP service to use (default: filesystem) |
| --models MODELS Comma-separated list of models to evaluate (e.g., 'o3,k2,gpt-4.1') |
| --tasks TASKS Tasks to run: "all", a category name, or "category/task_name" |
| --exp-name EXP_NAME Experiment name; results are saved under results/<exp-name>/ (default: YYYY-MM-DD-HH-MM-SS) |
| --k K Number of evaluation runs for pass@k metrics (default: 1) |
| --timeout TIMEOUT Timeout in seconds for each task |
| --output-dir OUTPUT_DIR |
| Directory to save results |
| ``` |
|
|
| ## Docker Benefits |
|
|
| 1. **Efficiency**: Only starts necessary containers |
| 2. **Isolation**: Each task runs in a fresh container |
| 3. **Resource Management**: Automatic cleanup of containers and networks |
| 4. **Smart Dependencies**: PostgreSQL only starts for postgres service |
| 5. **Parallel Support**: Can run multiple services simultaneously for faster benchmarks |
| 6. **Comprehensive Testing**: Benchmark script runs all services with one command |
| 7. **Progress Tracking**: Colored output with timing and status information |
| 8. **Automatic Reporting**: Generates summary reports and performance dashboards |
|
|
| ## Common Troubleshooting |
|
|
| ### Permission Issues |
| ```bash |
| chmod +x run-task.sh |
| ``` |
|
|
| ### Docker Build Issues |
| ```bash |
| # Force rebuild with no cache |
| ./run-task.sh --build --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK |
| ``` |
|
|
| ### PostgreSQL Connection Issues |
| ```bash |
| # Check if postgres is running |
| docker ps | grep postgres |
| |
| # View postgres logs |
| docker logs mcp-postgres-task |
| ``` |
|
|
| ### Cleanup Stuck Resources |
| ```bash |
| # Stop all containers |
| docker stop $(docker ps -q) |
| |
| # Remove task network |
| docker network rm mcp-task-network |
| |
| # Remove postgres data volume (careful!) |
| docker volume rm mcp-postgres-data |
| ``` |
|
|
| ## Environment Variables |
|
|
| Create `.mcp_env` file with your credentials: |
| ```env |
| # Service credentials |
| SOURCE_NOTION_API_KEY=your-key |
| EVAL_NOTION_API_KEY=your-key |
| GITHUB_TOKEN=your-token |
| POSTGRES_PASSWORD=your-password |
| |
| # Model API keys |
| OPENAI_API_KEY=your-key |
| ANTHROPIC_API_KEY=your-key |
| # ... etc |
| ``` |
|
|
| Please refer to [Quick Start](./quickstart.md) for setting up API key for specific model. |
|
|
| ## Docker Compose Files |
|
|
| - `docker-compose.yml` - Full stack with postgres (for development/testing) |
|
|
| ## Notes |
|
|
| - Results are saved under `./results/<exp-name>/`. |
| - Each task runs in an ephemeral container. |
| - Docker image is shared across all tasks. |
| - PostgreSQL data persists in Docker volume. |
|
|