mcpmark / docs /installation_and_docker_usage.md

Add files using upload-large-folder tool

97cb846 verified 24 days ago

7.54 kB

	# Installation and Docker Task Usage Guideline

	## Overview

	The MCPMark setup supports installation through either pip or MCPMark Docker (recommended) after cloning the code repository.

	### Pip Installtion
	```bash
	pip install -e .
	```

	The MCPMark Docker setup provides a simple way to run evaluation tasks in isolated containers. PostgreSQL is automatically handled when needed.

	## 1. Quick Start

	### 1.1 Docker Image

	The official Docker image is automatically pulled from Docker Hub on first use.
	The image is hosted at: https://hub.docker.com/r/evalsysorg/mcpmark

	Image Management:
	- The scripts automatically download the image when it's not found locally
	- To manually update to the latest version:
	```bash
	docker pull evalsysorg/mcpmark:latest
	```
	- For local development/testing, you can build your own docker:
	```bash
	# Creates evalsysorg/mcpmark:latest locally
	./build-docker.sh
	```

	## 2. Running MCP Experiments

	### 2.1 Running Individual MCP Experiment

	The `run-task.sh` script provides simplified Docker usage:

	```bash
	# Run filesystem tasks (filesystem is the default mcp service)
	./run-task.sh --models MODEL_NAME --k K

	# Run github/notion/postgres/playwright/playwright_webarena with specific task
	./run-task.sh --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK --k K
	```

	where MODEL_NAME refers to the model choice from the supported models (see [Introduction Page](./introduction.md) for more information), EXPNAME refers to customized experiment name, TASK refers to specific task or task group (see `tasks/<mcp>/<task_suite>/...` for more information), K refers to the time of independent experiments.


	Additionally, the `run-benchmark.sh` script evaluates models across all MCP services:

	```bash
	# Run all services with Docker (recommended)
	./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker

	# Run specific services
	./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES --docker

	# Run with parallel execution for faster results
	./run-benchmark.sh --models MODEL --exp-name EXPNAME --docker --parallel

	# Run locally without Docker
	./run-benchmark.sh --models MODEL --exp-name EXPNAME --mcps MCPSERVICES
	```

	Here MCPSERVICES refers to group of MCP services, separated by comma (e.g. filesystem,postgres)

	The benchmark script:
	- Runs all or selected MCP services automatically
	- Supports progress tracking and timing
	- Generates summary reports and logs
	- Supports parallel service execution
	- Continues running even if some services fail
	- Automatically generates performance dashboards

	### Manual Docker Commands

	#### For Non-Postgres Services
	Suppose Notion is the service:
	```bash
	# Build the image first
	./build-docker.sh

	# Run a task
	docker run --rm \
	-v $(pwd)/results:/app/results \
	-v $(pwd)/.mcp_env:/app/.mcp_env:ro \
	-v $(pwd)/notion_state.json:/app/notion_state.json:ro \
	evalsysorg/mcpmark:latest \
	python3 -m pipeline --mcp notion --models MODEL --exp-name EXPNAME --tasks TASK --k K
	```

	#### For Postgres Service
	```bash
	# The run-task.sh script handles postgres automatically, but if doing manually:

	# Start postgres container
	docker run -d \
	--name mcp-postgres \
	--network mcp-network \
	-e POSTGRES_DATABASE=postgres \
	-e POSTGRES_USER=postgres \
	-e POSTGRES_PASSWORD=123456 \
	ghcr.io/cloudnative-pg/postgresql:17-bookworm

	# Run postgres task
	docker run --rm \
	--network mcp-network \
	-e POSTGRES_HOST=mcp-postgres \
	-v $(pwd)/results:/app/results \
	-v $(pwd)/.mcp_env:/app/.mcp_env:ro \
	evalsysorg/mcpmark:latest \
	python3 -m pipeline --mcp postgres --models MODEL --exp-name EXPNAME --tasks TASK --k K

	# Stop and remove postgres when done
	docker stop mcp-postgres && docker rm mcp-postgres
	```

	## Script Usage

	### Benchmark Runner (`run-benchmark.sh`)

	```
	./run-benchmark.sh --models MODELS --exp-name NAME [OPTIONS]

	Required Options:
	--models MODELS Comma-separated list of models to evaluate
	--exp-name NAME Experiment name for organizing results

	Optional Options:
	--docker Run tasks in Docker containers (recommended)
	--mcps SERVICES Comma-separated list of services to test
	Default: filesystem,notion,github,postgres,playwright
	--parallel Run services in parallel (experimental)
	--timeout SECONDS Timeout per task in seconds (default: 300)
	```

	### Individual Task Runner (`run-task.sh`)

	```
	./run-task.sh [--mcp SERVICE] [PIPELINE_ARGS]

	Options:
	--mcp SERVICE MCP service (notion\|github\|filesystem\|playwright\|postgres)
	Default: filesystem

	Environment Variables:
	DOCKER_MEMORY_LIMIT Memory limit for container (default: 4g)
	DOCKER_CPU_LIMIT CPU limit for container (default: 2)
	DOCKER_IMAGE_VERSION Docker image tag to use (default: latest)

	All other arguments are passed directly to the pipeline command.

	Pipeline arguments (see python3 -m pipeline --help):
	--mcp {notion,github,filesystem,playwright,postgres,playwright_webarena}
	MCP service to use (default: filesystem)
	--models MODELS Comma-separated list of models to evaluate (e.g., 'o3,k2,gpt-4.1')
	--tasks TASKS Tasks to run: "all", a category name, or "category/task_name"
	--exp-name EXP_NAME Experiment name; results are saved under results/<exp-name>/ (default: YYYY-MM-DD-HH-MM-SS)
	--k K Number of evaluation runs for pass@k metrics (default: 1)
	--timeout TIMEOUT Timeout in seconds for each task
	--output-dir OUTPUT_DIR
	Directory to save results
	```

	## Docker Benefits

	1. Efficiency: Only starts necessary containers
	2. Isolation: Each task runs in a fresh container
	3. Resource Management: Automatic cleanup of containers and networks
	4. Smart Dependencies: PostgreSQL only starts for postgres service
	5. Parallel Support: Can run multiple services simultaneously for faster benchmarks
	6. Comprehensive Testing: Benchmark script runs all services with one command
	7. Progress Tracking: Colored output with timing and status information
	8. Automatic Reporting: Generates summary reports and performance dashboards

	## Common Troubleshooting

	### Permission Issues
	```bash
	chmod +x run-task.sh
	```

	### Docker Build Issues
	```bash
	# Force rebuild with no cache
	./run-task.sh --build --mcp MCPSERVICE --models MODEL_NAME --exp-name EXPNAME --tasks TASK
	```

	### PostgreSQL Connection Issues
	```bash
	# Check if postgres is running
	docker ps \| grep postgres

	# View postgres logs
	docker logs mcp-postgres-task
	```

	### Cleanup Stuck Resources
	```bash
	# Stop all containers
	docker stop $(docker ps -q)

	# Remove task network
	docker network rm mcp-task-network

	# Remove postgres data volume (careful!)
	docker volume rm mcp-postgres-data
	```

	## Environment Variables

	Create `.mcp_env` file with your credentials:
	```env
	# Service credentials
	SOURCE_NOTION_API_KEY=your-key
	EVAL_NOTION_API_KEY=your-key
	GITHUB_TOKEN=your-token
	POSTGRES_PASSWORD=your-password

	# Model API keys
	OPENAI_API_KEY=your-key
	ANTHROPIC_API_KEY=your-key
	# ... etc
	```

	Please refer to [Quick Start](./quickstart.md) for setting up API key for specific model.

	## Docker Compose Files

	- `docker-compose.yml` - Full stack with postgres (for development/testing)

	## Notes

	- Results are saved under `./results/<exp-name>/`.
	- Each task runs in an ephemeral container.
	- Docker image is shared across all tasks.
	- PostgreSQL data persists in Docker volume.