| # Development Testing Scripts |
|
|
| Quick scripts for testing the ShinkaEvolve + Eval Service integration with WandB logging. |
|
|
| ## π Quick Start |
|
|
| ### Prerequisites |
|
|
| 1. **Install dependencies:** |
| ```bash |
| pip install wandb uv |
| ``` |
|
|
| 2. **Setup WandB (first time only):** |
| ```bash |
| wandb login |
| ``` |
|
|
| ### Option 1: Quick Test (3 generations) |
|
|
| ```bash |
| # Terminal 1: Start eval service |
| bash scripts/dev/start_eval_server.sh |
| |
| # Terminal 2: Run quick test |
| bash scripts/dev/2_test_quick.sh |
| ``` |
|
|
| ### Option 2: Full Experiment (50 generations) |
|
|
| ```bash |
| # Terminal 1: Start eval service |
| bash scripts/dev/start_eval_server.sh |
| |
| # Terminal 2: Run full experiment |
| bash scripts/dev/3_test_full.sh |
| ``` |
|
|
| ### Option 3: Ablation Study (No Eval Service) |
|
|
| ```bash |
| # No need for eval service |
| bash scripts/dev/3b_ablation_no_eval_service.sh |
| ``` |
|
|
| --- |
|
|
| ## π Script Reference |
|
|
| ### Core Script: `run_experiment.py` |
| |
| Universal Python script that runs experiments with configurable parameters. |
| |
| **Key Features:** |
| - Single universal script for all experiments |
| - Command-line argument parsing |
| - WandB integration |
| - Eval service integration |
| - Automatic result directory creation |
| - Error handling and validation |
| |
| **Usage:** |
| ```bash |
| python scripts/dev/run_experiment.py \ |
| --experiment-name "my_experiment" \ |
| --num-generations 50 \ |
| --use-wandb \ |
| --use-eval-service |
| ``` |
| |
| ### Bash Wrappers |
|
|
| Bash scripts that configure hyperparameters and call `run_experiment.py`. |
|
|
| --- |
|
|
| ### 1. `start_eval_server.sh` (Recommended) |
| **Purpose:** Start the Eval Service with command-line configuration |
|
|
| **Configuration Variables:** |
| ```bash |
| RESULTS_DIR="/tmp/eval_service" |
| PRIMARY_EVALUATOR="examples/circle_packing/evaluate_ori.py" |
| TRIGGER_MODE="periodic" |
| TRIGGER_INTERVAL=5 |
| PORT=8765 |
| ``` |
|
|
| **Usage:** |
| ```bash |
| bash scripts/dev/start_eval_server.sh |
| ``` |
|
|
| **Customization:** |
| Edit the script directly to change parameters. All settings are at the top. |
|
|
| **Why this approach:** |
| - β
All config visible in one file |
| - β
Easy to create variants (copy & edit) |
| - β
No need to switch between files |
|
|
| ### 1b. `start_eval_server_config5.sh` (Alternative) |
| **Purpose:** Start using YAML config file (trigger every 5 gens) |
| |
| **Uses:** `eval_agent/ev2_service_config.yaml` |
|
|
| **When to use:** |
| - When you have a standard config you reuse |
| - When you want to version control configs separately |
|
|
| ### 1c. `start_eval_server_config10.sh` (Alternative) |
| **Purpose:** Start using YAML config file (trigger every 10 gens) |
| |
| **Note:** Currently shows how to override, but YAML doesn't support env var overrides yet |
| |
| --- |
| |
| ### 2. `2_test_quick.sh` |
| **Purpose:** Quick 3-generation test |
| |
| **Configuration Variables:** |
| ```bash |
| EXPERIMENT_NAME="quick_test" |
| NUM_GENERATIONS=3 |
| MAX_PARALLEL_JOBS=2 |
| META_INTERVAL=2 |
| |
| LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro" |
| LLM_SELECTION="ucb1" |
| LLM_TEMPERATURES="0.5 0.7 1.0" |
|
|
| USE_EVAL_SERVICE="--use-eval-service" |
| USE_WANDB="--use-wandb" |
| WANDB_PROJECT="shinkaevolve-dev" |
| WANDB_TAGS="quick-test eval-service" |
| ``` |
| |
| **Expected time:** ~2-5 minutes |
| |
| --- |
| |
| ### 3. `3_test_full.sh` |
| **Purpose:** Full 50-generation experiment |
| |
| **Configuration Variables:** |
| ```bash |
| EXPERIMENT_NAME="full_50gen" |
| NUM_GENERATIONS=50 |
| MAX_PARALLEL_JOBS=4 |
| META_INTERVAL=10 |
| |
| LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro" |
| LLM_SELECTION="ucb1" |
| LLM_TEMPERATURES="0.5 0.7 1.0" |
|
|
| USE_WANDB="--use-wandb" |
| WANDB_PROJECT="shinkaevolve-experiments" |
| WANDB_TAGS="full-experiment eval-service circle-packing" |
| ``` |
| |
| **Expected time:** ~30-60 minutes |
| |
| --- |
| |
| ### 3b. `3b_ablation_no_eval_service.sh` |
| **Purpose:** Baseline experiment without eval service |
| |
| **Key Difference:** `USE_EVAL_SERVICE` is NOT set |
| |
| **Use Case:** Ablation study to compare performance with/without eval service |
| |
| --- |
| |
| ### 4. `4_check_results.sh` |
| **Purpose:** Analyze experiment results |
| |
| **Shows:** |
| - Best program score and validation |
| - Eval agent memory contents |
| - Auxiliary metrics statistics |
| - Database statistics |
| - WandB run links |
| |
| **Usage:** |
| ```bash |
| # Check most recent results |
| bash scripts/dev/4_check_results.sh |
| |
| # Check specific directory |
| bash scripts/dev/4_check_results.sh examples/circle_packing/results/full_50gen_20240203_120000 |
| ``` |
| |
| --- |
| |
| ### 5. `5_cleanup.sh` |
| **Purpose:** Clean up temporary test files |
|
|
| **Usage:** |
| ```bash |
| bash scripts/dev/5_cleanup.sh |
| ``` |
|
|
| --- |
|
|
| ## ποΈ Hyperparameter Guide |
|
|
| ### Common Parameters |
|
|
| | Parameter | Quick Test | Full Test | Description | |
| |-----------|-----------|-----------|-------------| |
| | `NUM_GENERATIONS` | 3 | 50 | Total generations to evolve | |
| | `MAX_PARALLEL_JOBS` | 2 | 4 | Concurrent evaluation jobs | |
| | `META_INTERVAL` | 2 | 10 | Meta-summarizer frequency | |
| | `LLM_MODELS` | gemini-2.5-flash/pro | gemini-2.5-flash/pro | LLM models to use | |
| | `LLM_SELECTION` | ucb1 | ucb1 | Dynamic LLM selection strategy | |
| | `LLM_TEMPERATURES` | 0.5 0.7 1.0 | 0.5 0.7 1.0 | LLM sampling temperatures | |
|
|
| ### Eval Service Parameters |
|
|
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `USE_EVAL_SERVICE` | enabled | Enable eval service | |
| | `EVAL_SERVICE_URL` | http://localhost:8765 | Service URL | |
|
|
| ### WandB Parameters |
|
|
| | Parameter | Quick Test | Full Test | Description | |
| |-----------|-----------|-----------|-------------| |
| | `USE_WANDB` | enabled | enabled | Enable WandB logging | |
| | `WANDB_PROJECT` | shinkaevolve-dev | shinkaevolve-experiments | WandB project | |
| | `WANDB_TAGS` | quick-test | full-experiment eval-service | Space-separated tags | |
|
|
| --- |
|
|
| ## π§ Customization Examples |
|
|
| ### Example 1: Change Models |
|
|
| ```bash |
| # In 3_test_full.sh |
| LLM_MODELS="gpt-4o claude-3-5-sonnet-20241022" |
| LLM_SELECTION="thompson" # or "ucb1", "epsilon_greedy" |
| LLM_TEMPERATURES="0.5 0.7" |
| ``` |
|
|
| ### Example 2: Change WandB Project |
|
|
| ```bash |
| # In 2_test_quick.sh |
| WANDB_PROJECT="my-research-project" |
| WANDB_TAGS="my-tag another-tag" |
| ``` |
|
|
| ### Example 3: Change Agent Trigger Frequency |
|
|
| ```yaml |
| # In eval_agent/ev2_service_config.yaml |
| strategy: |
| trigger_interval: 10 # Change from 5 to 10 |
| ``` |
|
|
| ### Example 4: Run More Generations |
|
|
| ```bash |
| # In 3_test_full.sh |
| NUM_GENERATIONS=100 |
| ``` |
|
|
| ### Example 5: Disable WandB |
|
|
| ```bash |
| # In 2_test_quick.sh |
| USE_WANDB="" # Comment out or set empty |
| ``` |
|
|
| --- |
|
|
| ## π WandB Integration |
|
|
| ### What Gets Logged |
|
|
| 1. **Metrics:** |
| - Combined score per generation |
| - Best score over time |
| - Correct/incorrect programs |
| - Auxiliary metrics (if eval service enabled) |
|
|
| 2. **System Info:** |
| - Hyperparameters |
| - Model configuration |
| - Eval service status |
|
|
| 3. **Artifacts:** |
| - Best program code |
| - Evolution database |
| - Agent-generated metrics |
|
|
| ### Viewing Results |
|
|
| After running an experiment: |
| ```bash |
| # Get WandB URL from terminal output |
| # Or visit: https://wandb.ai/YOUR_ENTITY/YOUR_PROJECT |
| ``` |
|
|
| ### Comparing Runs |
|
|
| WandB automatically tracks all runs in the same project, allowing easy comparison: |
| - Baseline vs. Eval Service |
| - Different hyperparameters |
| - Different models |
|
|
| --- |
|
|
| ## π Verification Checklist |
|
|
| After running an experiment, check: |
|
|
| - [ ] **Eval Service Running** (for eval service experiments) |
| ```bash |
| curl http://localhost:8765/api/v1/status | jq |
| ``` |
|
|
| - [ ] **Experiment Completed** |
| ```bash |
| bash scripts/dev/4_check_results.sh |
| ``` |
|
|
| - [ ] **Best Program Valid** |
| ```bash |
| cat RESULTS_DIR/best/results/correct.json |
| # Should show "correct": true |
| ``` |
|
|
| - [ ] **Auxiliary Metrics Present** (for eval service experiments) |
| ```bash |
| cat RESULTS_DIR/gen_20/results/metrics.json | jq '.auxiliary' |
| # Should show metrics after agent triggers |
| ``` |
|
|
| - [ ] **WandB Run Logged** |
| - Check WandB dashboard |
| - Verify metrics are being logged |
|
|
| - [ ] **Agent Documentation Generated** (for eval service experiments) |
| ```bash |
| cat RESULTS_DIR/eval_agent_memory/EVAL_AGENTS.md | head -50 |
| ``` |
|
|
| --- |
|
|
| ## π Troubleshooting |
|
|
| ### Error: "Eval service not running" |
| **Solution:** |
| ```bash |
| bash scripts/dev/start_eval_server.sh |
| ``` |
|
|
| ### Error: "wandb not found" |
| **Solution:** |
| ```bash |
| pip install wandb |
| wandb login |
| ``` |
|
|
| ### Error: "Port 8765 already in use" |
| **Solution:** |
| ```bash |
| lsof -ti:8765 | xargs kill -9 |
| ``` |
|
|
| ### WandB not logging |
| **Solution:** |
| ```bash |
| # Re-login to WandB |
| wandb login |
| |
| # Check if USE_WANDB is set in bash script |
| echo $USE_WANDB |
| ``` |
|
|
| --- |
|
|
| ## π Results Structure |
|
|
| ``` |
| examples/circle_packing/results/{EXPERIMENT_NAME}_{TIMESTAMP}/ |
| βββ evolution_db.sqlite # Evolution database |
| βββ evolution_run.log # Detailed logs |
| βββ experiment_config.yaml # Configuration backup |
| βββ gen_0/ |
| β βββ main.py # Generated code |
| β βββ results/ |
| β βββ metrics.json # All metrics (primary + auxiliary) |
| β βββ correct.json # Validation status |
| βββ gen_1/ ... gen_N/ |
| βββ best/ # Best program (symlink) |
| β βββ main.py |
| β βββ results/ |
| β βββ metrics.json |
| βββ eval_agent_memory/ # Agent workspace (if eval service used) |
| βββ EVAL_AGENTS.md # Metric documentation |
| βββ auxiliary_metrics.py # Generated metrics code |
| βββ service_state.json # Service state |
| ``` |
|
|
| --- |
|
|
| ## π‘ Best Practices |
|
|
| 1. **Always start with quick test** to validate setup |
| 2. **Use WandB tags** for easy filtering of experiments |
| 3. **Run ablations** to demonstrate eval service impact |
| 4. **Check eval_agent_memory** to see what metrics were generated |
| 5. **Compare WandB runs** side-by-side for insights |
| 6. **Save important results** before cleanup |
|
|
| --- |
|
|
| ## π― Experiment Workflow |
|
|
| 1. **Start eval service** (Terminal 1) |
| ```bash |
| bash scripts/dev/start_eval_server.sh |
| ``` |
|
|
| 2. **Run quick test** to validate (Terminal 2) |
| ```bash |
| bash scripts/dev/2_test_quick.sh |
| ``` |
|
|
| 3. **Check results** |
| ```bash |
| bash scripts/dev/4_check_results.sh |
| ``` |
|
|
| 4. **Run full experiment** if test passes |
| ```bash |
| bash scripts/dev/3_test_full.sh |
| ``` |
|
|
| 5. **Compare with baseline** (ablation) |
| ```bash |
| bash scripts/dev/3b_ablation_no_eval_service.sh |
| ``` |
|
|
| 6. **Analyze on WandB** |
| - Compare runs |
| - Export plots |
| - Share results |
|
|
| --- |
|
|
| ## π Expected Results |
|
|
| ### Quick Test (3 generations) |
| - β
Completes in ~5 minutes |
| - β
WandB run created |
| - β
Metrics logged per generation |
| - β οΈ Agent likely not triggered (need 10+ gens) |
|
|
| ### Full Test (50 generations) |
| - β
Completes in ~1 hour |
| - β
Agent triggers 5 times (gen 10, 20, 30, 40, 50) |
| - β
Auxiliary metrics appear in later generations |
| - β
Metric descriptions in EVAL_AGENTS.md |
| - β
Complete WandB run with all metrics |
| |
| --- |
| |
| **Need help?** Check the main documentation or run with `--help`: |
| ```bash |
| python scripts/dev/run_experiment.py --help |
| ``` |
| |
| --- |
| |
| ## Frontier-CS Algorithmic Experiments |
| |
| Parallel scripts for running Frontier-CS competitive programming problems with evolution. |
| |
| ### Scripts |
| |
| | Script | Description | |
| |--------|-------------| |
| | `run_frontier_cs_parallel_vanilla_server.sh` | Vanilla baseline via eval service (agent never triggers) | |
| | `run_frontier_cs_parallel_with_agent.sh` | With eval agent (triggers every 5 generations) | |
| | `run_frontier_cs.sh` | Single problem, manual run | |
| |
| ### Usage |
| |
| ```bash |
| # Vanilla baseline - all 172 problems, 20 parallel |
| bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh |
| |
| # Vanilla baseline - first 50 problems, 20 parallel |
| bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 |
|
|
| # Vanilla baseline - first 50 problems, 10 parallel |
| bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 10 |
| |
| # With eval agent - all problems |
| bash scripts/dev/run_frontier_cs_parallel_with_agent.sh |
|
|
| # With eval agent - first 50 problems, 10 parallel |
| bash scripts/dev/run_frontier_cs_parallel_with_agent.sh 50 10 |
| ``` |
| |
| ### Comparing Results |
| |
| ```bash |
| # Compare two runs (new layout) |
| python tasks/frontier_cs_entry/compare_experiments.py \ |
| results/frontier_cs_algorithmic/vanilla_g50_20260327_120000 \ |
| results/frontier_cs_algorithmic/agent_g50_20260327_130000 |
| |
| # Sort by score difference |
| python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --sort diff |
| |
| # Export to CSV |
| python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --csv results/comparison.csv |
| ``` |
| |
| ### Results Directory Structure |
| |
| ``` |
| results/frontier_cs_algorithmic/ |
| vanilla_g50_20260327_120000/ # one run |
| p0/ # per-problem results |
| evolution_db.sqlite |
| gen_0/ gen_1/ ... gen_49/ |
| p1/ |
| ... |
| agent_g50_20260327_130000/ # another run |
| p0/ |
| p1/ |
| ... |
| ``` |
| |
| ### Prerequisites |
|
|
| - Docker running with go-judge service on port 8081 |
| - `tasks/Frontier-CS/` checked out with problems and solutions |
|
|