JustinTX

Add files using upload-large-folder tool

3f6526a verified 17 days ago

preview code

raw

history blame contribute delete

12.7 kB

Development Testing Scripts

Quick scripts for testing the ShinkaEvolve + Eval Service integration with WandB logging.

🚀 Quick Start

Prerequisites

Install dependencies:
```
pip install wandb uv
```
Setup WandB (first time only):
```
wandb login
```

Option 1: Quick Test (3 generations)

# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh

# Terminal 2: Run quick test
bash scripts/dev/2_test_quick.sh

Option 2: Full Experiment (50 generations)

# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh

# Terminal 2: Run full experiment
bash scripts/dev/3_test_full.sh

Option 3: Ablation Study (No Eval Service)

# No need for eval service
bash scripts/dev/3b_ablation_no_eval_service.sh

📜 Script Reference

Core Script: `run_experiment.py`

Universal Python script that runs experiments with configurable parameters.

Key Features:

Single universal script for all experiments
Command-line argument parsing
WandB integration
Eval service integration
Automatic result directory creation
Error handling and validation

Usage:

python scripts/dev/run_experiment.py \
    --experiment-name "my_experiment" \
    --num-generations 50 \
    --use-wandb \
    --use-eval-service

Bash Wrappers

Bash scripts that configure hyperparameters and call run_experiment.py.

1. `start_eval_server.sh` (Recommended)

Purpose: Start the Eval Service with command-line configuration

Configuration Variables:

RESULTS_DIR="/tmp/eval_service"
PRIMARY_EVALUATOR="examples/circle_packing/evaluate_ori.py"
TRIGGER_MODE="periodic"
TRIGGER_INTERVAL=5
PORT=8765

Usage:

bash scripts/dev/start_eval_server.sh

Customization: Edit the script directly to change parameters. All settings are at the top.

Why this approach:

✅ All config visible in one file
✅ Easy to create variants (copy & edit)
✅ No need to switch between files

1b. `start_eval_server_config5.sh` (Alternative)

Purpose: Start using YAML config file (trigger every 5 gens)

Uses: eval_agent/ev2_service_config.yaml

When to use:

When you have a standard config you reuse
When you want to version control configs separately

1c. `start_eval_server_config10.sh` (Alternative)

Purpose: Start using YAML config file (trigger every 10 gens)

Note: Currently shows how to override, but YAML doesn't support env var overrides yet

2. `2_test_quick.sh`

Purpose: Quick 3-generation test

Configuration Variables:

EXPERIMENT_NAME="quick_test"
NUM_GENERATIONS=3
MAX_PARALLEL_JOBS=2
META_INTERVAL=2

LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"

USE_EVAL_SERVICE="--use-eval-service"
USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-dev"
WANDB_TAGS="quick-test eval-service"

Expected time: ~2-5 minutes

3. `3_test_full.sh`

Purpose: Full 50-generation experiment

Configuration Variables:

EXPERIMENT_NAME="full_50gen"
NUM_GENERATIONS=50
MAX_PARALLEL_JOBS=4
META_INTERVAL=10

LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"

USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-experiments"
WANDB_TAGS="full-experiment eval-service circle-packing"

Expected time: ~30-60 minutes

3b. `3b_ablation_no_eval_service.sh`

Purpose: Baseline experiment without eval service

Key Difference: USE_EVAL_SERVICE is NOT set

Use Case: Ablation study to compare performance with/without eval service

4. `4_check_results.sh`

Purpose: Analyze experiment results

Shows:

Best program score and validation
Eval agent memory contents
Auxiliary metrics statistics
Database statistics
WandB run links

Usage:

# Check most recent results
bash scripts/dev/4_check_results.sh

# Check specific directory
bash scripts/dev/4_check_results.sh examples/circle_packing/results/full_50gen_20240203_120000

5. `5_cleanup.sh`

Purpose: Clean up temporary test files

Usage:

bash scripts/dev/5_cleanup.sh

🎛️ Hyperparameter Guide

Common Parameters

Parameter	Quick Test	Full Test	Description
`NUM_GENERATIONS`	3	50	Total generations to evolve
`MAX_PARALLEL_JOBS`	2	4	Concurrent evaluation jobs
`META_INTERVAL`	2	10	Meta-summarizer frequency
`LLM_MODELS`	gemini-2.5-flash/pro	gemini-2.5-flash/pro	LLM models to use
`LLM_SELECTION`	ucb1	ucb1	Dynamic LLM selection strategy
`LLM_TEMPERATURES`	0.5 0.7 1.0	0.5 0.7 1.0	LLM sampling temperatures

Eval Service Parameters

Parameter	Default	Description
`USE_EVAL_SERVICE`	enabled	Enable eval service
`EVAL_SERVICE_URL`	http://localhost:8765	Service URL

WandB Parameters

Parameter	Quick Test	Full Test	Description
`USE_WANDB`	enabled	enabled	Enable WandB logging
`WANDB_PROJECT`	shinkaevolve-dev	shinkaevolve-experiments	WandB project
`WANDB_TAGS`	quick-test	full-experiment eval-service	Space-separated tags

🔧 Customization Examples

Example 1: Change Models

# In 3_test_full.sh
LLM_MODELS="gpt-4o claude-3-5-sonnet-20241022"
LLM_SELECTION="thompson"  # or "ucb1", "epsilon_greedy"
LLM_TEMPERATURES="0.5 0.7"

Example 2: Change WandB Project

# In 2_test_quick.sh
WANDB_PROJECT="my-research-project"
WANDB_TAGS="my-tag another-tag"

Example 3: Change Agent Trigger Frequency

# In eval_agent/ev2_service_config.yaml
strategy:
  trigger_interval: 10  # Change from 5 to 10

Example 4: Run More Generations

# In 3_test_full.sh
NUM_GENERATIONS=100

Example 5: Disable WandB

# In 2_test_quick.sh
USE_WANDB=""  # Comment out or set empty

📊 WandB Integration

What Gets Logged

Metrics:
- Combined score per generation
- Best score over time
- Correct/incorrect programs
- Auxiliary metrics (if eval service enabled)
System Info:
- Hyperparameters
- Model configuration
- Eval service status
Artifacts:
- Best program code
- Evolution database
- Agent-generated metrics

Viewing Results

After running an experiment:

# Get WandB URL from terminal output
# Or visit: https://wandb.ai/YOUR_ENTITY/YOUR_PROJECT

Comparing Runs

WandB automatically tracks all runs in the same project, allowing easy comparison:

Baseline vs. Eval Service
Different hyperparameters
Different models

🔍 Verification Checklist

After running an experiment, check:

Eval Service Running (for eval service experiments)
```
curl http://localhost:8765/api/v1/status | jq
```
Experiment Completed
```
bash scripts/dev/4_check_results.sh
```

Best Program Valid

cat RESULTS_DIR/best/results/correct.json
# Should show "correct": true

Auxiliary Metrics Present (for eval service experiments)

cat RESULTS_DIR/gen_20/results/metrics.json | jq '.auxiliary'
# Should show metrics after agent triggers

WandB Run Logged
- Check WandB dashboard
- Verify metrics are being logged

Agent Documentation Generated (for eval service experiments)

cat RESULTS_DIR/eval_agent_memory/EVAL_AGENTS.md | head -50

🐛 Troubleshooting

Error: "Eval service not running"

Solution:

bash scripts/dev/start_eval_server.sh

Error: "wandb not found"

Solution:

pip install wandb
wandb login

Error: "Port 8765 already in use"

Solution:

lsof -ti:8765 | xargs kill -9

WandB not logging

Solution:

# Re-login to WandB
wandb login

# Check if USE_WANDB is set in bash script
echo $USE_WANDB

📁 Results Structure

examples/circle_packing/results/{EXPERIMENT_NAME}_{TIMESTAMP}/
├── evolution_db.sqlite                 # Evolution database
├── evolution_run.log                   # Detailed logs
├── experiment_config.yaml              # Configuration backup
├── gen_0/
│   ├── main.py                        # Generated code
│   └── results/
│       ├── metrics.json               # All metrics (primary + auxiliary)
│       └── correct.json               # Validation status
├── gen_1/ ... gen_N/
├── best/                              # Best program (symlink)
│   ├── main.py
│   └── results/
│       └── metrics.json
└── eval_agent_memory/                 # Agent workspace (if eval service used)
    ├── EVAL_AGENTS.md                 # Metric documentation
    ├── auxiliary_metrics.py           # Generated metrics code
    └── service_state.json             # Service state

💡 Best Practices

Always start with quick test to validate setup
Use WandB tags for easy filtering of experiments
Run ablations to demonstrate eval service impact
Check eval_agent_memory to see what metrics were generated
Compare WandB runs side-by-side for insights
Save important results before cleanup

🎯 Experiment Workflow

Start eval service (Terminal 1)
```
bash scripts/dev/start_eval_server.sh
```
Run quick test to validate (Terminal 2)
```
bash scripts/dev/2_test_quick.sh
```
Check results
```
bash scripts/dev/4_check_results.sh
```
Run full experiment if test passes
```
bash scripts/dev/3_test_full.sh
```

Compare with baseline (ablation)

bash scripts/dev/3b_ablation_no_eval_service.sh

Analyze on WandB
- Compare runs
- Export plots
- Share results

📈 Expected Results

Quick Test (3 generations)

✅ Completes in ~5 minutes
✅ WandB run created
✅ Metrics logged per generation
⚠️ Agent likely not triggered (need 10+ gens)

Full Test (50 generations)

✅ Completes in ~1 hour
✅ Agent triggers 5 times (gen 10, 20, 30, 40, 50)
✅ Auxiliary metrics appear in later generations
✅ Metric descriptions in EVAL_AGENTS.md
✅ Complete WandB run with all metrics

Need help? Check the main documentation or run with --help:

python scripts/dev/run_experiment.py --help

Frontier-CS Algorithmic Experiments

Parallel scripts for running Frontier-CS competitive programming problems with evolution.

Scripts

Script	Description
`run_frontier_cs_parallel_vanilla_server.sh`	Vanilla baseline via eval service (agent never triggers)
`run_frontier_cs_parallel_with_agent.sh`	With eval agent (triggers every 5 generations)
`run_frontier_cs.sh`	Single problem, manual run

Usage

# Vanilla baseline - all 172 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh

# Vanilla baseline - first 50 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50

# Vanilla baseline - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 10

# With eval agent - all problems
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh

# With eval agent - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh 50 10

Comparing Results

# Compare two runs (new layout)
python tasks/frontier_cs_entry/compare_experiments.py \
    results/frontier_cs_algorithmic/vanilla_g50_20260327_120000 \
    results/frontier_cs_algorithmic/agent_g50_20260327_130000

# Sort by score difference
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --sort diff

# Export to CSV
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --csv results/comparison.csv

Results Directory Structure

results/frontier_cs_algorithmic/
  vanilla_g50_20260327_120000/    # one run
    p0/                           # per-problem results
      evolution_db.sqlite
      gen_0/ gen_1/ ... gen_49/
    p1/
    ...
  agent_g50_20260327_130000/      # another run
    p0/
    p1/
    ...

Prerequisites

Docker running with go-judge service on port 8081
tasks/Frontier-CS/ checked out with problems and solutions

Development Testing Scripts

🚀 Quick Start

Prerequisites

Option 1: Quick Test (3 generations)

Option 2: Full Experiment (50 generations)

Option 3: Ablation Study (No Eval Service)

📜 Script Reference

Core Script: run_experiment.py

Bash Wrappers

1. start_eval_server.sh (Recommended)

1b. start_eval_server_config5.sh (Alternative)

1c. start_eval_server_config10.sh (Alternative)

2. 2_test_quick.sh

3. 3_test_full.sh

3b. 3b_ablation_no_eval_service.sh

4. 4_check_results.sh

5. 5_cleanup.sh

🎛️ Hyperparameter Guide

Common Parameters

Eval Service Parameters

WandB Parameters

🔧 Customization Examples

Example 1: Change Models

Example 2: Change WandB Project

Example 3: Change Agent Trigger Frequency

Example 4: Run More Generations

Example 5: Disable WandB

📊 WandB Integration

What Gets Logged

Viewing Results

Comparing Runs

🔍 Verification Checklist

🐛 Troubleshooting

Error: "Eval service not running"

Error: "wandb not found"

Error: "Port 8765 already in use"

WandB not logging

📁 Results Structure

💡 Best Practices

🎯 Experiment Workflow

📈 Expected Results

Quick Test (3 generations)

Full Test (50 generations)

Frontier-CS Algorithmic Experiments

Scripts

Usage

Comparing Results

Results Directory Structure

Prerequisites

Core Script: `run_experiment.py`

1. `start_eval_server.sh` (Recommended)

1b. `start_eval_server_config5.sh` (Alternative)

1c. `start_eval_server_config10.sh` (Alternative)

2. `2_test_quick.sh`

3. `3_test_full.sh`

3b. `3b_ablation_no_eval_service.sh`

4. `4_check_results.sh`

5. `5_cleanup.sh`