mcpmark / docs /installation_and_docker_usage.md
haochengsama's picture
Add files using upload-large-folder tool
97cb846 verified
|
Raw
History Blame Contribute Delete
7.54 kB
# Installation and Docker Task Usage Guideline
## Overview
The MCPMark setup supports installation through either pip or MCPMark Docker (recommended) after cloning the code repository.
### Pip Installtion
```bash
pip install -e .
```
The MCPMark Docker setup provides a simple way to run evaluation tasks in isolated containers. PostgreSQL is automatically handled when needed.
## 1. Quick Start
### 1.1 Docker Image
The official Docker image is automatically pulled from Docker Hub on first use.
The image is hosted at: https://hub.docker.com/r/evalsysorg/mcpmark
**Image Management:**
- The scripts automatically download the image when it's not found locally
- To manually update to the latest version:
```bash
docker pull evalsysorg/mcpmark:latest
```
- For local development/testing, you can build your own docker:
```bash
# Creates evalsysorg/mcpmark:latest locally
./build-docker.sh
```
## 2. Running MCP Experiments
### 2.1 Running Individual MCP Experiment
The `run-task.sh` script provides simplified Docker usage:
```bash
# Run filesystem tasks (filesystem is the default mcp service)
./run-task.sh --models MODEL_NAME --k K
# Run github/notion/postgres/playwright/playwright_webarena with specific task
./run-task.sh --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK --k K
```
where *MODEL_NAME* refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), *EXPNAME* refers to customized experiment name, *TASK* refers to specific task or task group (see `tasks/<mcp>/<task_suite>/...` for more information), *K* refers to the time of independent experiments.
Additionally, the `run-benchmark.sh` script evaluates models across all MCP services:
```bash
# Run all services with Docker (recommended)
./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker
# Run specific services
./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES --docker
# Run with parallel execution for faster results
./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker --parallel
# Run locally without Docker
./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES
```
Here *MCPSERVICES* refers to group of MCP services, separated by comma (e.g. *filesystem,postgres*)
The benchmark script:
- Runs all or selected MCP services automatically
- Supports progress tracking and timing
- Generates summary reports and logs
- Supports parallel service execution
- Continues running even if some services fail
- Automatically generates performance dashboards
### Manual Docker Commands
#### For Non-Postgres Services
Suppose Notion is the service:
```bash
# Build the image first
./build-docker.sh
# Run a task
docker run --rm \
-v $(pwd)/results:/app/results \
-v $(pwd)/.mcp_env:/app/.mcp_env:ro \
-v $(pwd)/notion_state.json:/app/notion_state.json:ro \
evalsysorg/mcpmark:latest \
python3 -m pipeline --mcp notion --models MODEL --exp-name EXPNAME --tasks TASK --k K
```
#### For Postgres Service
```bash
# The run-task.sh script handles postgres automatically, but if doing manually:
# Start postgres container
docker run -d \
--name mcp-postgres \
--network mcp-network \
-e POSTGRES_DATABASE=postgres \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=123456 \
ghcr.io/cloudnative-pg/postgresql:17-bookworm
# Run postgres task
docker run --rm \
--network mcp-network \
-e POSTGRES_HOST=mcp-postgres \
-v $(pwd)/results:/app/results \
-v $(pwd)/.mcp_env:/app/.mcp_env:ro \
evalsysorg/mcpmark:latest \
python3 -m pipeline --mcp postgres --models MODEL --exp-name EXPNAME --tasks TASK --k K
# Stop and remove postgres when done
docker stop mcp-postgres && docker rm mcp-postgres
```
## Script Usage
### Benchmark Runner (`run-benchmark.sh`)
```
./run-benchmark.sh --models MODELS --exp-name NAME [OPTIONS]
Required Options:
--models MODELS Comma-separated list of models to evaluate
--exp-name NAME Experiment name for organizing results
Optional Options:
--docker Run tasks in Docker containers (recommended)
--mcps SERVICES Comma-separated list of services to test
Default: filesystem,notion,github,postgres,playwright
--parallel Run services in parallel (experimental)
--timeout SECONDS Timeout per task in seconds (default: 300)
```
### Individual Task Runner (`run-task.sh`)
```
./run-task.sh [--mcp SERVICE] [PIPELINE_ARGS]
Options:
--mcp SERVICE MCP service (notion|github|filesystem|playwright|postgres)
Default: filesystem
Environment Variables:
DOCKER_MEMORY_LIMIT Memory limit for container (default: 4g)
DOCKER_CPU_LIMIT CPU limit for container (default: 2)
DOCKER_IMAGE_VERSION Docker image tag to use (default: latest)
All other arguments are passed directly to the pipeline command.
Pipeline arguments (see python3 -m pipeline --help):
--mcp {notion,github,filesystem,playwright,postgres,playwright_webarena}
MCP service to use (default: filesystem)
--models MODELS Comma-separated list of models to evaluate (e.g., 'o3,k2,gpt-4.1')
--tasks TASKS Tasks to run: "all", a category name, or "category/task_name"
--exp-name EXP_NAME Experiment name; results are saved under results/<exp-name>/ (default: YYYY-MM-DD-HH-MM-SS)
--k K Number of evaluation runs for pass@k metrics (default: 1)
--timeout TIMEOUT Timeout in seconds for each task
--output-dir OUTPUT_DIR
Directory to save results
```
## Docker Benefits
1. **Efficiency**: Only starts necessary containers
2. **Isolation**: Each task runs in a fresh container
3. **Resource Management**: Automatic cleanup of containers and networks
4. **Smart Dependencies**: PostgreSQL only starts for postgres service
5. **Parallel Support**: Can run multiple services simultaneously for faster benchmarks
6. **Comprehensive Testing**: Benchmark script runs all services with one command
7. **Progress Tracking**: Colored output with timing and status information
8. **Automatic Reporting**: Generates summary reports and performance dashboards
## Common Troubleshooting
### Permission Issues
```bash
chmod +x run-task.sh
```
### Docker Build Issues
```bash
# Force rebuild with no cache
./run-task.sh --build --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK
```
### PostgreSQL Connection Issues
```bash
# Check if postgres is running
docker ps | grep postgres
# View postgres logs
docker logs mcp-postgres-task
```
### Cleanup Stuck Resources
```bash
# Stop all containers
docker stop $(docker ps -q)
# Remove task network
docker network rm mcp-task-network
# Remove postgres data volume (careful!)
docker volume rm mcp-postgres-data
```
## Environment Variables
Create `.mcp_env` file with your credentials:
```env
# Service credentials
SOURCE_NOTION_API_KEY=your-key
EVAL_NOTION_API_KEY=your-key
GITHUB_TOKEN=your-token
POSTGRES_PASSWORD=your-password
# Model API keys
OPENAI_API_KEY=your-key
ANTHROPIC_API_KEY=your-key
# ... etc
```
Please refer to [Quick Start](./quickstart.md) for setting up API key for specific model.
## Docker Compose Files
- `docker-compose.yml` - Full stack with postgres (for development/testing)
## Notes
- Results are saved under `./results/<exp-name>/`.
- Each task runs in an ephemeral container.
- Docker image is shared across all tasks.
- PostgreSQL data persists in Docker volume.