Development Testing Scripts
Quick scripts for testing the ShinkaEvolve + Eval Service integration with WandB logging.
π Quick Start
Prerequisites
Install dependencies:
pip install wandb uvSetup WandB (first time only):
wandb login
Option 1: Quick Test (3 generations)
# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh
# Terminal 2: Run quick test
bash scripts/dev/2_test_quick.sh
Option 2: Full Experiment (50 generations)
# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh
# Terminal 2: Run full experiment
bash scripts/dev/3_test_full.sh
Option 3: Ablation Study (No Eval Service)
# No need for eval service
bash scripts/dev/3b_ablation_no_eval_service.sh
π Script Reference
Core Script: run_experiment.py
Universal Python script that runs experiments with configurable parameters.
Key Features:
- Single universal script for all experiments
- Command-line argument parsing
- WandB integration
- Eval service integration
- Automatic result directory creation
- Error handling and validation
Usage:
python scripts/dev/run_experiment.py \
--experiment-name "my_experiment" \
--num-generations 50 \
--use-wandb \
--use-eval-service
Bash Wrappers
Bash scripts that configure hyperparameters and call run_experiment.py.
1. start_eval_server.sh (Recommended)
Purpose: Start the Eval Service with command-line configuration
Configuration Variables:
RESULTS_DIR="/tmp/eval_service"
PRIMARY_EVALUATOR="examples/circle_packing/evaluate_ori.py"
TRIGGER_MODE="periodic"
TRIGGER_INTERVAL=5
PORT=8765
Usage:
bash scripts/dev/start_eval_server.sh
Customization: Edit the script directly to change parameters. All settings are at the top.
Why this approach:
- β All config visible in one file
- β Easy to create variants (copy & edit)
- β No need to switch between files
1b. start_eval_server_config5.sh (Alternative)
Purpose: Start using YAML config file (trigger every 5 gens)
Uses: eval_agent/ev2_service_config.yaml
When to use:
- When you have a standard config you reuse
- When you want to version control configs separately
1c. start_eval_server_config10.sh (Alternative)
Purpose: Start using YAML config file (trigger every 10 gens)
Note: Currently shows how to override, but YAML doesn't support env var overrides yet
2. 2_test_quick.sh
Purpose: Quick 3-generation test
Configuration Variables:
EXPERIMENT_NAME="quick_test"
NUM_GENERATIONS=3
MAX_PARALLEL_JOBS=2
META_INTERVAL=2
LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"
USE_EVAL_SERVICE="--use-eval-service"
USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-dev"
WANDB_TAGS="quick-test eval-service"
Expected time: ~2-5 minutes
3. 3_test_full.sh
Purpose: Full 50-generation experiment
Configuration Variables:
EXPERIMENT_NAME="full_50gen"
NUM_GENERATIONS=50
MAX_PARALLEL_JOBS=4
META_INTERVAL=10
LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"
USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-experiments"
WANDB_TAGS="full-experiment eval-service circle-packing"
Expected time: ~30-60 minutes
3b. 3b_ablation_no_eval_service.sh
Purpose: Baseline experiment without eval service
Key Difference: USE_EVAL_SERVICE is NOT set
Use Case: Ablation study to compare performance with/without eval service
4. 4_check_results.sh
Purpose: Analyze experiment results
Shows:
- Best program score and validation
- Eval agent memory contents
- Auxiliary metrics statistics
- Database statistics
- WandB run links
Usage:
# Check most recent results
bash scripts/dev/4_check_results.sh
# Check specific directory
bash scripts/dev/4_check_results.sh examples/circle_packing/results/full_50gen_20240203_120000
5. 5_cleanup.sh
Purpose: Clean up temporary test files
Usage:
bash scripts/dev/5_cleanup.sh
ποΈ Hyperparameter Guide
Common Parameters
| Parameter | Quick Test | Full Test | Description |
|---|---|---|---|
NUM_GENERATIONS |
3 | 50 | Total generations to evolve |
MAX_PARALLEL_JOBS |
2 | 4 | Concurrent evaluation jobs |
META_INTERVAL |
2 | 10 | Meta-summarizer frequency |
LLM_MODELS |
gemini-2.5-flash/pro | gemini-2.5-flash/pro | LLM models to use |
LLM_SELECTION |
ucb1 | ucb1 | Dynamic LLM selection strategy |
LLM_TEMPERATURES |
0.5 0.7 1.0 | 0.5 0.7 1.0 | LLM sampling temperatures |
Eval Service Parameters
| Parameter | Default | Description |
|---|---|---|
USE_EVAL_SERVICE |
enabled | Enable eval service |
EVAL_SERVICE_URL |
http://localhost:8765 | Service URL |
WandB Parameters
| Parameter | Quick Test | Full Test | Description |
|---|---|---|---|
USE_WANDB |
enabled | enabled | Enable WandB logging |
WANDB_PROJECT |
shinkaevolve-dev | shinkaevolve-experiments | WandB project |
WANDB_TAGS |
quick-test | full-experiment eval-service | Space-separated tags |
π§ Customization Examples
Example 1: Change Models
# In 3_test_full.sh
LLM_MODELS="gpt-4o claude-3-5-sonnet-20241022"
LLM_SELECTION="thompson" # or "ucb1", "epsilon_greedy"
LLM_TEMPERATURES="0.5 0.7"
Example 2: Change WandB Project
# In 2_test_quick.sh
WANDB_PROJECT="my-research-project"
WANDB_TAGS="my-tag another-tag"
Example 3: Change Agent Trigger Frequency
# In eval_agent/ev2_service_config.yaml
strategy:
trigger_interval: 10 # Change from 5 to 10
Example 4: Run More Generations
# In 3_test_full.sh
NUM_GENERATIONS=100
Example 5: Disable WandB
# In 2_test_quick.sh
USE_WANDB="" # Comment out or set empty
π WandB Integration
What Gets Logged
Metrics:
- Combined score per generation
- Best score over time
- Correct/incorrect programs
- Auxiliary metrics (if eval service enabled)
System Info:
- Hyperparameters
- Model configuration
- Eval service status
Artifacts:
- Best program code
- Evolution database
- Agent-generated metrics
Viewing Results
After running an experiment:
# Get WandB URL from terminal output
# Or visit: https://wandb.ai/YOUR_ENTITY/YOUR_PROJECT
Comparing Runs
WandB automatically tracks all runs in the same project, allowing easy comparison:
- Baseline vs. Eval Service
- Different hyperparameters
- Different models
π Verification Checklist
After running an experiment, check:
Eval Service Running (for eval service experiments)
curl http://localhost:8765/api/v1/status | jqExperiment Completed
bash scripts/dev/4_check_results.shBest Program Valid
cat RESULTS_DIR/best/results/correct.json # Should show "correct": trueAuxiliary Metrics Present (for eval service experiments)
cat RESULTS_DIR/gen_20/results/metrics.json | jq '.auxiliary' # Should show metrics after agent triggersWandB Run Logged
- Check WandB dashboard
- Verify metrics are being logged
Agent Documentation Generated (for eval service experiments)
cat RESULTS_DIR/eval_agent_memory/EVAL_AGENTS.md | head -50
π Troubleshooting
Error: "Eval service not running"
Solution:
bash scripts/dev/start_eval_server.sh
Error: "wandb not found"
Solution:
pip install wandb
wandb login
Error: "Port 8765 already in use"
Solution:
lsof -ti:8765 | xargs kill -9
WandB not logging
Solution:
# Re-login to WandB
wandb login
# Check if USE_WANDB is set in bash script
echo $USE_WANDB
π Results Structure
examples/circle_packing/results/{EXPERIMENT_NAME}_{TIMESTAMP}/
βββ evolution_db.sqlite # Evolution database
βββ evolution_run.log # Detailed logs
βββ experiment_config.yaml # Configuration backup
βββ gen_0/
β βββ main.py # Generated code
β βββ results/
β βββ metrics.json # All metrics (primary + auxiliary)
β βββ correct.json # Validation status
βββ gen_1/ ... gen_N/
βββ best/ # Best program (symlink)
β βββ main.py
β βββ results/
β βββ metrics.json
βββ eval_agent_memory/ # Agent workspace (if eval service used)
βββ EVAL_AGENTS.md # Metric documentation
βββ auxiliary_metrics.py # Generated metrics code
βββ service_state.json # Service state
π‘ Best Practices
- Always start with quick test to validate setup
- Use WandB tags for easy filtering of experiments
- Run ablations to demonstrate eval service impact
- Check eval_agent_memory to see what metrics were generated
- Compare WandB runs side-by-side for insights
- Save important results before cleanup
π― Experiment Workflow
Start eval service (Terminal 1)
bash scripts/dev/start_eval_server.shRun quick test to validate (Terminal 2)
bash scripts/dev/2_test_quick.shCheck results
bash scripts/dev/4_check_results.shRun full experiment if test passes
bash scripts/dev/3_test_full.shCompare with baseline (ablation)
bash scripts/dev/3b_ablation_no_eval_service.shAnalyze on WandB
- Compare runs
- Export plots
- Share results
π Expected Results
Quick Test (3 generations)
- β Completes in ~5 minutes
- β WandB run created
- β Metrics logged per generation
- β οΈ Agent likely not triggered (need 10+ gens)
Full Test (50 generations)
- β Completes in ~1 hour
- β Agent triggers 5 times (gen 10, 20, 30, 40, 50)
- β Auxiliary metrics appear in later generations
- β Metric descriptions in EVAL_AGENTS.md
- β Complete WandB run with all metrics
Need help? Check the main documentation or run with --help:
python scripts/dev/run_experiment.py --help
Frontier-CS Algorithmic Experiments
Parallel scripts for running Frontier-CS competitive programming problems with evolution.
Scripts
| Script | Description |
|---|---|
run_frontier_cs_parallel_vanilla_server.sh |
Vanilla baseline via eval service (agent never triggers) |
run_frontier_cs_parallel_with_agent.sh |
With eval agent (triggers every 5 generations) |
run_frontier_cs.sh |
Single problem, manual run |
Usage
# Vanilla baseline - all 172 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh
# Vanilla baseline - first 50 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50
# Vanilla baseline - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 10
# With eval agent - all problems
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh
# With eval agent - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh 50 10
Comparing Results
# Compare two runs (new layout)
python tasks/frontier_cs_entry/compare_experiments.py \
results/frontier_cs_algorithmic/vanilla_g50_20260327_120000 \
results/frontier_cs_algorithmic/agent_g50_20260327_130000
# Sort by score difference
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --sort diff
# Export to CSV
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --csv results/comparison.csv
Results Directory Structure
results/frontier_cs_algorithmic/
vanilla_g50_20260327_120000/ # one run
p0/ # per-problem results
evolution_db.sqlite
gen_0/ gen_1/ ... gen_49/
p1/
...
agent_g50_20260327_130000/ # another run
p0/
p1/
...
Prerequisites
- Docker running with go-judge service on port 8081
tasks/Frontier-CS/checked out with problems and solutions