# Development Testing Scripts Quick scripts for testing the ShinkaEvolve + Eval Service integration with WandB logging. ## 🚀 Quick Start ### Prerequisites 1. **Install dependencies:** ```bash pip install wandb uv ``` 2. **Setup WandB (first time only):** ```bash wandb login ``` ### Option 1: Quick Test (3 generations) ```bash # Terminal 1: Start eval service bash scripts/dev/start_eval_server.sh # Terminal 2: Run quick test bash scripts/dev/2_test_quick.sh ``` ### Option 2: Full Experiment (50 generations) ```bash # Terminal 1: Start eval service bash scripts/dev/start_eval_server.sh # Terminal 2: Run full experiment bash scripts/dev/3_test_full.sh ``` ### Option 3: Ablation Study (No Eval Service) ```bash # No need for eval service bash scripts/dev/3b_ablation_no_eval_service.sh ``` --- ## 📜 Script Reference ### Core Script: `run_experiment.py` Universal Python script that runs experiments with configurable parameters. **Key Features:** - Single universal script for all experiments - Command-line argument parsing - WandB integration - Eval service integration - Automatic result directory creation - Error handling and validation **Usage:** ```bash python scripts/dev/run_experiment.py \ --experiment-name "my_experiment" \ --num-generations 50 \ --use-wandb \ --use-eval-service ``` ### Bash Wrappers Bash scripts that configure hyperparameters and call `run_experiment.py`. --- ### 1. `start_eval_server.sh` (Recommended) **Purpose:** Start the Eval Service with command-line configuration **Configuration Variables:** ```bash RESULTS_DIR="/tmp/eval_service" PRIMARY_EVALUATOR="examples/circle_packing/evaluate_ori.py" TRIGGER_MODE="periodic" TRIGGER_INTERVAL=5 PORT=8765 ``` **Usage:** ```bash bash scripts/dev/start_eval_server.sh ``` **Customization:** Edit the script directly to change parameters. All settings are at the top. **Why this approach:** - ✅ All config visible in one file - ✅ Easy to create variants (copy & edit) - ✅ No need to switch between files ### 1b. `start_eval_server_config5.sh` (Alternative) **Purpose:** Start using YAML config file (trigger every 5 gens) **Uses:** `eval_agent/ev2_service_config.yaml` **When to use:** - When you have a standard config you reuse - When you want to version control configs separately ### 1c. `start_eval_server_config10.sh` (Alternative) **Purpose:** Start using YAML config file (trigger every 10 gens) **Note:** Currently shows how to override, but YAML doesn't support env var overrides yet --- ### 2. `2_test_quick.sh` **Purpose:** Quick 3-generation test **Configuration Variables:** ```bash EXPERIMENT_NAME="quick_test" NUM_GENERATIONS=3 MAX_PARALLEL_JOBS=2 META_INTERVAL=2 LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro" LLM_SELECTION="ucb1" LLM_TEMPERATURES="0.5 0.7 1.0" USE_EVAL_SERVICE="--use-eval-service" USE_WANDB="--use-wandb" WANDB_PROJECT="shinkaevolve-dev" WANDB_TAGS="quick-test eval-service" ``` **Expected time:** ~2-5 minutes --- ### 3. `3_test_full.sh` **Purpose:** Full 50-generation experiment **Configuration Variables:** ```bash EXPERIMENT_NAME="full_50gen" NUM_GENERATIONS=50 MAX_PARALLEL_JOBS=4 META_INTERVAL=10 LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro" LLM_SELECTION="ucb1" LLM_TEMPERATURES="0.5 0.7 1.0" USE_WANDB="--use-wandb" WANDB_PROJECT="shinkaevolve-experiments" WANDB_TAGS="full-experiment eval-service circle-packing" ``` **Expected time:** ~30-60 minutes --- ### 3b. `3b_ablation_no_eval_service.sh` **Purpose:** Baseline experiment without eval service **Key Difference:** `USE_EVAL_SERVICE` is NOT set **Use Case:** Ablation study to compare performance with/without eval service --- ### 4. `4_check_results.sh` **Purpose:** Analyze experiment results **Shows:** - Best program score and validation - Eval agent memory contents - Auxiliary metrics statistics - Database statistics - WandB run links **Usage:** ```bash # Check most recent results bash scripts/dev/4_check_results.sh # Check specific directory bash scripts/dev/4_check_results.sh examples/circle_packing/results/full_50gen_20240203_120000 ``` --- ### 5. `5_cleanup.sh` **Purpose:** Clean up temporary test files **Usage:** ```bash bash scripts/dev/5_cleanup.sh ``` --- ## 🎛️ Hyperparameter Guide ### Common Parameters | Parameter | Quick Test | Full Test | Description | |-----------|-----------|-----------|-------------| | `NUM_GENERATIONS` | 3 | 50 | Total generations to evolve | | `MAX_PARALLEL_JOBS` | 2 | 4 | Concurrent evaluation jobs | | `META_INTERVAL` | 2 | 10 | Meta-summarizer frequency | | `LLM_MODELS` | gemini-2.5-flash/pro | gemini-2.5-flash/pro | LLM models to use | | `LLM_SELECTION` | ucb1 | ucb1 | Dynamic LLM selection strategy | | `LLM_TEMPERATURES` | 0.5 0.7 1.0 | 0.5 0.7 1.0 | LLM sampling temperatures | ### Eval Service Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `USE_EVAL_SERVICE` | enabled | Enable eval service | | `EVAL_SERVICE_URL` | http://localhost:8765 | Service URL | ### WandB Parameters | Parameter | Quick Test | Full Test | Description | |-----------|-----------|-----------|-------------| | `USE_WANDB` | enabled | enabled | Enable WandB logging | | `WANDB_PROJECT` | shinkaevolve-dev | shinkaevolve-experiments | WandB project | | `WANDB_TAGS` | quick-test | full-experiment eval-service | Space-separated tags | --- ## 🔧 Customization Examples ### Example 1: Change Models ```bash # In 3_test_full.sh LLM_MODELS="gpt-4o claude-3-5-sonnet-20241022" LLM_SELECTION="thompson" # or "ucb1", "epsilon_greedy" LLM_TEMPERATURES="0.5 0.7" ``` ### Example 2: Change WandB Project ```bash # In 2_test_quick.sh WANDB_PROJECT="my-research-project" WANDB_TAGS="my-tag another-tag" ``` ### Example 3: Change Agent Trigger Frequency ```yaml # In eval_agent/ev2_service_config.yaml strategy: trigger_interval: 10 # Change from 5 to 10 ``` ### Example 4: Run More Generations ```bash # In 3_test_full.sh NUM_GENERATIONS=100 ``` ### Example 5: Disable WandB ```bash # In 2_test_quick.sh USE_WANDB="" # Comment out or set empty ``` --- ## 📊 WandB Integration ### What Gets Logged 1. **Metrics:** - Combined score per generation - Best score over time - Correct/incorrect programs - Auxiliary metrics (if eval service enabled) 2. **System Info:** - Hyperparameters - Model configuration - Eval service status 3. **Artifacts:** - Best program code - Evolution database - Agent-generated metrics ### Viewing Results After running an experiment: ```bash # Get WandB URL from terminal output # Or visit: https://wandb.ai/YOUR_ENTITY/YOUR_PROJECT ``` ### Comparing Runs WandB automatically tracks all runs in the same project, allowing easy comparison: - Baseline vs. Eval Service - Different hyperparameters - Different models --- ## 🔍 Verification Checklist After running an experiment, check: - [ ] **Eval Service Running** (for eval service experiments) ```bash curl http://localhost:8765/api/v1/status | jq ``` - [ ] **Experiment Completed** ```bash bash scripts/dev/4_check_results.sh ``` - [ ] **Best Program Valid** ```bash cat RESULTS_DIR/best/results/correct.json # Should show "correct": true ``` - [ ] **Auxiliary Metrics Present** (for eval service experiments) ```bash cat RESULTS_DIR/gen_20/results/metrics.json | jq '.auxiliary' # Should show metrics after agent triggers ``` - [ ] **WandB Run Logged** - Check WandB dashboard - Verify metrics are being logged - [ ] **Agent Documentation Generated** (for eval service experiments) ```bash cat RESULTS_DIR/eval_agent_memory/EVAL_AGENTS.md | head -50 ``` --- ## 🐛 Troubleshooting ### Error: "Eval service not running" **Solution:** ```bash bash scripts/dev/start_eval_server.sh ``` ### Error: "wandb not found" **Solution:** ```bash pip install wandb wandb login ``` ### Error: "Port 8765 already in use" **Solution:** ```bash lsof -ti:8765 | xargs kill -9 ``` ### WandB not logging **Solution:** ```bash # Re-login to WandB wandb login # Check if USE_WANDB is set in bash script echo $USE_WANDB ``` --- ## 📁 Results Structure ``` examples/circle_packing/results/{EXPERIMENT_NAME}_{TIMESTAMP}/ ├── evolution_db.sqlite # Evolution database ├── evolution_run.log # Detailed logs ├── experiment_config.yaml # Configuration backup ├── gen_0/ │ ├── main.py # Generated code │ └── results/ │ ├── metrics.json # All metrics (primary + auxiliary) │ └── correct.json # Validation status ├── gen_1/ ... gen_N/ ├── best/ # Best program (symlink) │ ├── main.py │ └── results/ │ └── metrics.json └── eval_agent_memory/ # Agent workspace (if eval service used) ├── EVAL_AGENTS.md # Metric documentation ├── auxiliary_metrics.py # Generated metrics code └── service_state.json # Service state ``` --- ## 💡 Best Practices 1. **Always start with quick test** to validate setup 2. **Use WandB tags** for easy filtering of experiments 3. **Run ablations** to demonstrate eval service impact 4. **Check eval_agent_memory** to see what metrics were generated 5. **Compare WandB runs** side-by-side for insights 6. **Save important results** before cleanup --- ## 🎯 Experiment Workflow 1. **Start eval service** (Terminal 1) ```bash bash scripts/dev/start_eval_server.sh ``` 2. **Run quick test** to validate (Terminal 2) ```bash bash scripts/dev/2_test_quick.sh ``` 3. **Check results** ```bash bash scripts/dev/4_check_results.sh ``` 4. **Run full experiment** if test passes ```bash bash scripts/dev/3_test_full.sh ``` 5. **Compare with baseline** (ablation) ```bash bash scripts/dev/3b_ablation_no_eval_service.sh ``` 6. **Analyze on WandB** - Compare runs - Export plots - Share results --- ## 📈 Expected Results ### Quick Test (3 generations) - ✅ Completes in ~5 minutes - ✅ WandB run created - ✅ Metrics logged per generation - ⚠️ Agent likely not triggered (need 10+ gens) ### Full Test (50 generations) - ✅ Completes in ~1 hour - ✅ Agent triggers 5 times (gen 10, 20, 30, 40, 50) - ✅ Auxiliary metrics appear in later generations - ✅ Metric descriptions in EVAL_AGENTS.md - ✅ Complete WandB run with all metrics --- **Need help?** Check the main documentation or run with `--help`: ```bash python scripts/dev/run_experiment.py --help ``` --- ## Frontier-CS Algorithmic Experiments Parallel scripts for running Frontier-CS competitive programming problems with evolution. ### Scripts | Script | Description | |--------|-------------| | `run_frontier_cs_parallel_vanilla_server.sh` | Vanilla baseline via eval service (agent never triggers) | | `run_frontier_cs_parallel_with_agent.sh` | With eval agent (triggers every 5 generations) | | `run_frontier_cs.sh` | Single problem, manual run | ### Usage ```bash # Vanilla baseline - all 172 problems, 20 parallel bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh # Vanilla baseline - first 50 problems, 20 parallel bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 # Vanilla baseline - first 50 problems, 10 parallel bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 10 # With eval agent - all problems bash scripts/dev/run_frontier_cs_parallel_with_agent.sh # With eval agent - first 50 problems, 10 parallel bash scripts/dev/run_frontier_cs_parallel_with_agent.sh 50 10 ``` ### Comparing Results ```bash # Compare two runs (new layout) python tasks/frontier_cs_entry/compare_experiments.py \ results/frontier_cs_algorithmic/vanilla_g50_20260327_120000 \ results/frontier_cs_algorithmic/agent_g50_20260327_130000 # Sort by score difference python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --sort diff # Export to CSV python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --csv results/comparison.csv ``` ### Results Directory Structure ``` results/frontier_cs_algorithmic/ vanilla_g50_20260327_120000/ # one run p0/ # per-problem results evolution_db.sqlite gen_0/ gen_1/ ... gen_49/ p1/ ... agent_g50_20260327_130000/ # another run p0/ p1/ ... ``` ### Prerequisites - Docker running with go-judge service on port 8081 - `tasks/Frontier-CS/` checked out with problems and solutions