Add files using upload-large-folder tool

3f6526a verified 18 days ago

12.7 kB

	# Development Testing Scripts

	Quick scripts for testing the ShinkaEvolve + Eval Service integration with WandB logging.

	## 🚀 Quick Start

	### Prerequisites

	1. Install dependencies:
	```bash
	pip install wandb uv
	```

	2. Setup WandB (first time only):
	```bash
	wandb login
	```

	### Option 1: Quick Test (3 generations)

	```bash
	# Terminal 1: Start eval service
	bash scripts/dev/start_eval_server.sh

	# Terminal 2: Run quick test
	bash scripts/dev/2_test_quick.sh
	```

	### Option 2: Full Experiment (50 generations)

	```bash
	# Terminal 1: Start eval service
	bash scripts/dev/start_eval_server.sh

	# Terminal 2: Run full experiment
	bash scripts/dev/3_test_full.sh
	```

	### Option 3: Ablation Study (No Eval Service)

	```bash
	# No need for eval service
	bash scripts/dev/3b_ablation_no_eval_service.sh
	```

	---

	## 📜 Script Reference

	### Core Script: `run_experiment.py`

	Universal Python script that runs experiments with configurable parameters.

	Key Features:
	- Single universal script for all experiments
	- Command-line argument parsing
	- WandB integration
	- Eval service integration
	- Automatic result directory creation
	- Error handling and validation

	Usage:
	```bash
	python scripts/dev/run_experiment.py \
	--experiment-name "my_experiment" \
	--num-generations 50 \
	--use-wandb \
	--use-eval-service
	```

	### Bash Wrappers

	Bash scripts that configure hyperparameters and call `run_experiment.py`.

	---

	### 1. `start_eval_server.sh` (Recommended)
	Purpose: Start the Eval Service with command-line configuration

	Configuration Variables:
	```bash
	RESULTS_DIR="/tmp/eval_service"
	PRIMARY_EVALUATOR="examples/circle_packing/evaluate_ori.py"
	TRIGGER_MODE="periodic"
	TRIGGER_INTERVAL=5
	PORT=8765
	```

	Usage:
	```bash
	bash scripts/dev/start_eval_server.sh
	```

	Customization:
	Edit the script directly to change parameters. All settings are at the top.

	Why this approach:
	- ✅ All config visible in one file
	- ✅ Easy to create variants (copy & edit)
	- ✅ No need to switch between files

	### 1b. `start_eval_server_config5.sh` (Alternative)
	Purpose: Start using YAML config file (trigger every 5 gens)

	Uses: `eval_agent/ev2_service_config.yaml`

	When to use:
	- When you have a standard config you reuse
	- When you want to version control configs separately

	### 1c. `start_eval_server_config10.sh` (Alternative)
	Purpose: Start using YAML config file (trigger every 10 gens)

	Note: Currently shows how to override, but YAML doesn't support env var overrides yet

	---

	### 2. `2_test_quick.sh`
	Purpose: Quick 3-generation test

	Configuration Variables:
	```bash
	EXPERIMENT_NAME="quick_test"
	NUM_GENERATIONS=3
	MAX_PARALLEL_JOBS=2
	META_INTERVAL=2

	LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
	LLM_SELECTION="ucb1"
	LLM_TEMPERATURES="0.5 0.7 1.0"

	USE_EVAL_SERVICE="--use-eval-service"
	USE_WANDB="--use-wandb"
	WANDB_PROJECT="shinkaevolve-dev"
	WANDB_TAGS="quick-test eval-service"
	```

	Expected time: ~2-5 minutes

	---

	### 3. `3_test_full.sh`
	Purpose: Full 50-generation experiment

	Configuration Variables:
	```bash
	EXPERIMENT_NAME="full_50gen"
	NUM_GENERATIONS=50
	MAX_PARALLEL_JOBS=4
	META_INTERVAL=10

	LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
	LLM_SELECTION="ucb1"
	LLM_TEMPERATURES="0.5 0.7 1.0"

	USE_WANDB="--use-wandb"
	WANDB_PROJECT="shinkaevolve-experiments"
	WANDB_TAGS="full-experiment eval-service circle-packing"
	```

	Expected time: ~30-60 minutes

	---

	### 3b. `3b_ablation_no_eval_service.sh`
	Purpose: Baseline experiment without eval service

	Key Difference: `USE_EVAL_SERVICE` is NOT set

	Use Case: Ablation study to compare performance with/without eval service

	---

	### 4. `4_check_results.sh`
	Purpose: Analyze experiment results

	Shows:
	- Best program score and validation
	- Eval agent memory contents
	- Auxiliary metrics statistics
	- Database statistics
	- WandB run links

	Usage:
	```bash
	# Check most recent results
	bash scripts/dev/4_check_results.sh

	# Check specific directory
	bash scripts/dev/4_check_results.sh examples/circle_packing/results/full_50gen_20240203_120000
	```

	---

	### 5. `5_cleanup.sh`
	Purpose: Clean up temporary test files

	Usage:
	```bash
	bash scripts/dev/5_cleanup.sh
	```

	---

	## 🎛️ Hyperparameter Guide

	### Common Parameters

	\| Parameter \| Quick Test \| Full Test \| Description \|
	\|-----------\|-----------\|-----------\|-------------\|
	\| `NUM_GENERATIONS` \| 3 \| 50 \| Total generations to evolve \|
	\| `MAX_PARALLEL_JOBS` \| 2 \| 4 \| Concurrent evaluation jobs \|
	\| `META_INTERVAL` \| 2 \| 10 \| Meta-summarizer frequency \|
	\| `LLM_MODELS` \| gemini-2.5-flash/pro \| gemini-2.5-flash/pro \| LLM models to use \|
	\| `LLM_SELECTION` \| ucb1 \| ucb1 \| Dynamic LLM selection strategy \|
	\| `LLM_TEMPERATURES` \| 0.5 0.7 1.0 \| 0.5 0.7 1.0 \| LLM sampling temperatures \|

	### Eval Service Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `USE_EVAL_SERVICE` \| enabled \| Enable eval service \|
	\| `EVAL_SERVICE_URL` \| http://localhost:8765 \| Service URL \|

	### WandB Parameters

	\| Parameter \| Quick Test \| Full Test \| Description \|
	\|-----------\|-----------\|-----------\|-------------\|
	\| `USE_WANDB` \| enabled \| enabled \| Enable WandB logging \|
	\| `WANDB_PROJECT` \| shinkaevolve-dev \| shinkaevolve-experiments \| WandB project \|
	\| `WANDB_TAGS` \| quick-test \| full-experiment eval-service \| Space-separated tags \|

	---

	## 🔧 Customization Examples

	### Example 1: Change Models

	```bash
	# In 3_test_full.sh
	LLM_MODELS="gpt-4o claude-3-5-sonnet-20241022"
	LLM_SELECTION="thompson" # or "ucb1", "epsilon_greedy"
	LLM_TEMPERATURES="0.5 0.7"
	```

	### Example 2: Change WandB Project

	```bash
	# In 2_test_quick.sh
	WANDB_PROJECT="my-research-project"
	WANDB_TAGS="my-tag another-tag"
	```

	### Example 3: Change Agent Trigger Frequency

	```yaml
	# In eval_agent/ev2_service_config.yaml
	strategy:
	trigger_interval: 10 # Change from 5 to 10
	```

	### Example 4: Run More Generations

	```bash
	# In 3_test_full.sh
	NUM_GENERATIONS=100
	```

	### Example 5: Disable WandB

	```bash
	# In 2_test_quick.sh
	USE_WANDB="" # Comment out or set empty
	```

	---

	## 📊 WandB Integration

	### What Gets Logged

	1. Metrics:
	- Combined score per generation
	- Best score over time
	- Correct/incorrect programs
	- Auxiliary metrics (if eval service enabled)

	2. System Info:
	- Hyperparameters
	- Model configuration
	- Eval service status

	3. Artifacts:
	- Best program code
	- Evolution database
	- Agent-generated metrics

	### Viewing Results

	After running an experiment:
	```bash
	# Get WandB URL from terminal output
	# Or visit: https://wandb.ai/YOUR_ENTITY/YOUR_PROJECT
	```

	### Comparing Runs

	WandB automatically tracks all runs in the same project, allowing easy comparison:
	- Baseline vs. Eval Service
	- Different hyperparameters
	- Different models

	---

	## 🔍 Verification Checklist

	After running an experiment, check:

	- [ ] Eval Service Running (for eval service experiments)
	```bash
	curl http://localhost:8765/api/v1/status \| jq
	```

	- [ ] Experiment Completed
	```bash
	bash scripts/dev/4_check_results.sh
	```

	- [ ] Best Program Valid
	```bash
	cat RESULTS_DIR/best/results/correct.json
	# Should show "correct": true
	```

	- [ ] Auxiliary Metrics Present (for eval service experiments)
	```bash
	cat RESULTS_DIR/gen_20/results/metrics.json \| jq '.auxiliary'
	# Should show metrics after agent triggers
	```

	- [ ] WandB Run Logged
	- Check WandB dashboard
	- Verify metrics are being logged

	- [ ] Agent Documentation Generated (for eval service experiments)
	```bash
	cat RESULTS_DIR/eval_agent_memory/EVAL_AGENTS.md \| head -50
	```

	---

	## 🐛 Troubleshooting

	### Error: "Eval service not running"
	Solution:
	```bash
	bash scripts/dev/start_eval_server.sh
	```

	### Error: "wandb not found"
	Solution:
	```bash
	pip install wandb
	wandb login
	```

	### Error: "Port 8765 already in use"
	Solution:
	```bash
	lsof -ti:8765 \| xargs kill -9
	```

	### WandB not logging
	Solution:
	```bash
	# Re-login to WandB
	wandb login

	# Check if USE_WANDB is set in bash script
	echo $USE_WANDB
	```

	---

	## 📁 Results Structure

	```
	examples/circle_packing/results/{EXPERIMENT_NAME}_{TIMESTAMP}/
	├── evolution_db.sqlite # Evolution database
	├── evolution_run.log # Detailed logs
	├── experiment_config.yaml # Configuration backup
	├── gen_0/
	│ ├── main.py # Generated code
	│ └── results/
	│ ├── metrics.json # All metrics (primary + auxiliary)
	│ └── correct.json # Validation status
	├── gen_1/ ... gen_N/
	├── best/ # Best program (symlink)
	│ ├── main.py
	│ └── results/
	│ └── metrics.json
	└── eval_agent_memory/ # Agent workspace (if eval service used)
	├── EVAL_AGENTS.md # Metric documentation
	├── auxiliary_metrics.py # Generated metrics code
	└── service_state.json # Service state
	```

	---

	## 💡 Best Practices

	1. Always start with quick test to validate setup
	2. Use WandB tags for easy filtering of experiments
	3. Run ablations to demonstrate eval service impact
	4. Check eval_agent_memory to see what metrics were generated
	5. Compare WandB runs side-by-side for insights
	6. Save important results before cleanup

	---

	## 🎯 Experiment Workflow

	1. Start eval service (Terminal 1)
	```bash
	bash scripts/dev/start_eval_server.sh
	```

	2. Run quick test to validate (Terminal 2)
	```bash
	bash scripts/dev/2_test_quick.sh
	```

	3. Check results
	```bash
	bash scripts/dev/4_check_results.sh
	```

	4. Run full experiment if test passes
	```bash
	bash scripts/dev/3_test_full.sh
	```

	5. Compare with baseline (ablation)
	```bash
	bash scripts/dev/3b_ablation_no_eval_service.sh
	```

	6. Analyze on WandB
	- Compare runs
	- Export plots
	- Share results

	---

	## 📈 Expected Results

	### Quick Test (3 generations)
	- ✅ Completes in ~5 minutes
	- ✅ WandB run created
	- ✅ Metrics logged per generation
	- ⚠️ Agent likely not triggered (need 10+ gens)

	### Full Test (50 generations)
	- ✅ Completes in ~1 hour
	- ✅ Agent triggers 5 times (gen 10, 20, 30, 40, 50)
	- ✅ Auxiliary metrics appear in later generations
	- ✅ Metric descriptions in EVAL_AGENTS.md
	- ✅ Complete WandB run with all metrics

	---

	Need help? Check the main documentation or run with `--help`:
	```bash
	python scripts/dev/run_experiment.py --help
	```

	---

	## Frontier-CS Algorithmic Experiments

	Parallel scripts for running Frontier-CS competitive programming problems with evolution.

	### Scripts

	\| Script \| Description \|
	\|--------\|-------------\|
	\| `run_frontier_cs_parallel_vanilla_server.sh` \| Vanilla baseline via eval service (agent never triggers) \|
	\| `run_frontier_cs_parallel_with_agent.sh` \| With eval agent (triggers every 5 generations) \|
	\| `run_frontier_cs.sh` \| Single problem, manual run \|

	### Usage

	```bash
	# Vanilla baseline - all 172 problems, 20 parallel
	bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh

	# Vanilla baseline - first 50 problems, 20 parallel
	bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50

	# Vanilla baseline - first 50 problems, 10 parallel
	bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 10

	# With eval agent - all problems
	bash scripts/dev/run_frontier_cs_parallel_with_agent.sh

	# With eval agent - first 50 problems, 10 parallel
	bash scripts/dev/run_frontier_cs_parallel_with_agent.sh 50 10
	```

	### Comparing Results

	```bash
	# Compare two runs (new layout)
	python tasks/frontier_cs_entry/compare_experiments.py \
	results/frontier_cs_algorithmic/vanilla_g50_20260327_120000 \
	results/frontier_cs_algorithmic/agent_g50_20260327_130000

	# Sort by score difference
	python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --sort diff

	# Export to CSV
	python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --csv results/comparison.csv
	```

	### Results Directory Structure

	```
	results/frontier_cs_algorithmic/
	vanilla_g50_20260327_120000/ # one run
	p0/ # per-problem results
	evolution_db.sqlite
	gen_0/ gen_1/ ... gen_49/
	p1/
	...
	agent_g50_20260327_130000/ # another run
	p0/
	p1/
	...
	```

	### Prerequisites

	- Docker running with go-judge service on port 8081
	- `tasks/Frontier-CS/` checked out with problems and solutions