shinka-backup / scripts /dev /README.md
JustinTX's picture
Add files using upload-large-folder tool
3f6526a verified

Development Testing Scripts

Quick scripts for testing the ShinkaEvolve + Eval Service integration with WandB logging.

πŸš€ Quick Start

Prerequisites

  1. Install dependencies:

    pip install wandb uv
    
  2. Setup WandB (first time only):

    wandb login
    

Option 1: Quick Test (3 generations)

# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh

# Terminal 2: Run quick test
bash scripts/dev/2_test_quick.sh

Option 2: Full Experiment (50 generations)

# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh

# Terminal 2: Run full experiment
bash scripts/dev/3_test_full.sh

Option 3: Ablation Study (No Eval Service)

# No need for eval service
bash scripts/dev/3b_ablation_no_eval_service.sh

πŸ“œ Script Reference

Core Script: run_experiment.py

Universal Python script that runs experiments with configurable parameters.

Key Features:

  • Single universal script for all experiments
  • Command-line argument parsing
  • WandB integration
  • Eval service integration
  • Automatic result directory creation
  • Error handling and validation

Usage:

python scripts/dev/run_experiment.py \
    --experiment-name "my_experiment" \
    --num-generations 50 \
    --use-wandb \
    --use-eval-service

Bash Wrappers

Bash scripts that configure hyperparameters and call run_experiment.py.


1. start_eval_server.sh (Recommended)

Purpose: Start the Eval Service with command-line configuration

Configuration Variables:

RESULTS_DIR="/tmp/eval_service"
PRIMARY_EVALUATOR="examples/circle_packing/evaluate_ori.py"
TRIGGER_MODE="periodic"
TRIGGER_INTERVAL=5
PORT=8765

Usage:

bash scripts/dev/start_eval_server.sh

Customization: Edit the script directly to change parameters. All settings are at the top.

Why this approach:

  • βœ… All config visible in one file
  • βœ… Easy to create variants (copy & edit)
  • βœ… No need to switch between files

1b. start_eval_server_config5.sh (Alternative)

Purpose: Start using YAML config file (trigger every 5 gens)

Uses: eval_agent/ev2_service_config.yaml

When to use:

  • When you have a standard config you reuse
  • When you want to version control configs separately

1c. start_eval_server_config10.sh (Alternative)

Purpose: Start using YAML config file (trigger every 10 gens)

Note: Currently shows how to override, but YAML doesn't support env var overrides yet


2. 2_test_quick.sh

Purpose: Quick 3-generation test

Configuration Variables:

EXPERIMENT_NAME="quick_test"
NUM_GENERATIONS=3
MAX_PARALLEL_JOBS=2
META_INTERVAL=2

LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"

USE_EVAL_SERVICE="--use-eval-service"
USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-dev"
WANDB_TAGS="quick-test eval-service"

Expected time: ~2-5 minutes


3. 3_test_full.sh

Purpose: Full 50-generation experiment

Configuration Variables:

EXPERIMENT_NAME="full_50gen"
NUM_GENERATIONS=50
MAX_PARALLEL_JOBS=4
META_INTERVAL=10

LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"

USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-experiments"
WANDB_TAGS="full-experiment eval-service circle-packing"

Expected time: ~30-60 minutes


3b. 3b_ablation_no_eval_service.sh

Purpose: Baseline experiment without eval service

Key Difference: USE_EVAL_SERVICE is NOT set

Use Case: Ablation study to compare performance with/without eval service


4. 4_check_results.sh

Purpose: Analyze experiment results

Shows:

  • Best program score and validation
  • Eval agent memory contents
  • Auxiliary metrics statistics
  • Database statistics
  • WandB run links

Usage:

# Check most recent results
bash scripts/dev/4_check_results.sh

# Check specific directory
bash scripts/dev/4_check_results.sh examples/circle_packing/results/full_50gen_20240203_120000

5. 5_cleanup.sh

Purpose: Clean up temporary test files

Usage:

bash scripts/dev/5_cleanup.sh

πŸŽ›οΈ Hyperparameter Guide

Common Parameters

Parameter Quick Test Full Test Description
NUM_GENERATIONS 3 50 Total generations to evolve
MAX_PARALLEL_JOBS 2 4 Concurrent evaluation jobs
META_INTERVAL 2 10 Meta-summarizer frequency
LLM_MODELS gemini-2.5-flash/pro gemini-2.5-flash/pro LLM models to use
LLM_SELECTION ucb1 ucb1 Dynamic LLM selection strategy
LLM_TEMPERATURES 0.5 0.7 1.0 0.5 0.7 1.0 LLM sampling temperatures

Eval Service Parameters

Parameter Default Description
USE_EVAL_SERVICE enabled Enable eval service
EVAL_SERVICE_URL http://localhost:8765 Service URL

WandB Parameters

Parameter Quick Test Full Test Description
USE_WANDB enabled enabled Enable WandB logging
WANDB_PROJECT shinkaevolve-dev shinkaevolve-experiments WandB project
WANDB_TAGS quick-test full-experiment eval-service Space-separated tags

πŸ”§ Customization Examples

Example 1: Change Models

# In 3_test_full.sh
LLM_MODELS="gpt-4o claude-3-5-sonnet-20241022"
LLM_SELECTION="thompson"  # or "ucb1", "epsilon_greedy"
LLM_TEMPERATURES="0.5 0.7"

Example 2: Change WandB Project

# In 2_test_quick.sh
WANDB_PROJECT="my-research-project"
WANDB_TAGS="my-tag another-tag"

Example 3: Change Agent Trigger Frequency

# In eval_agent/ev2_service_config.yaml
strategy:
  trigger_interval: 10  # Change from 5 to 10

Example 4: Run More Generations

# In 3_test_full.sh
NUM_GENERATIONS=100

Example 5: Disable WandB

# In 2_test_quick.sh
USE_WANDB=""  # Comment out or set empty

πŸ“Š WandB Integration

What Gets Logged

  1. Metrics:

    • Combined score per generation
    • Best score over time
    • Correct/incorrect programs
    • Auxiliary metrics (if eval service enabled)
  2. System Info:

    • Hyperparameters
    • Model configuration
    • Eval service status
  3. Artifacts:

    • Best program code
    • Evolution database
    • Agent-generated metrics

Viewing Results

After running an experiment:

# Get WandB URL from terminal output
# Or visit: https://wandb.ai/YOUR_ENTITY/YOUR_PROJECT

Comparing Runs

WandB automatically tracks all runs in the same project, allowing easy comparison:

  • Baseline vs. Eval Service
  • Different hyperparameters
  • Different models

πŸ” Verification Checklist

After running an experiment, check:

  • Eval Service Running (for eval service experiments)

    curl http://localhost:8765/api/v1/status | jq
    
  • Experiment Completed

    bash scripts/dev/4_check_results.sh
    
  • Best Program Valid

    cat RESULTS_DIR/best/results/correct.json
    # Should show "correct": true
    
  • Auxiliary Metrics Present (for eval service experiments)

    cat RESULTS_DIR/gen_20/results/metrics.json | jq '.auxiliary'
    # Should show metrics after agent triggers
    
  • WandB Run Logged

    • Check WandB dashboard
    • Verify metrics are being logged
  • Agent Documentation Generated (for eval service experiments)

    cat RESULTS_DIR/eval_agent_memory/EVAL_AGENTS.md | head -50
    

πŸ› Troubleshooting

Error: "Eval service not running"

Solution:

bash scripts/dev/start_eval_server.sh

Error: "wandb not found"

Solution:

pip install wandb
wandb login

Error: "Port 8765 already in use"

Solution:

lsof -ti:8765 | xargs kill -9

WandB not logging

Solution:

# Re-login to WandB
wandb login

# Check if USE_WANDB is set in bash script
echo $USE_WANDB

πŸ“ Results Structure

examples/circle_packing/results/{EXPERIMENT_NAME}_{TIMESTAMP}/
β”œβ”€β”€ evolution_db.sqlite                 # Evolution database
β”œβ”€β”€ evolution_run.log                   # Detailed logs
β”œβ”€β”€ experiment_config.yaml              # Configuration backup
β”œβ”€β”€ gen_0/
β”‚   β”œβ”€β”€ main.py                        # Generated code
β”‚   └── results/
β”‚       β”œβ”€β”€ metrics.json               # All metrics (primary + auxiliary)
β”‚       └── correct.json               # Validation status
β”œβ”€β”€ gen_1/ ... gen_N/
β”œβ”€β”€ best/                              # Best program (symlink)
β”‚   β”œβ”€β”€ main.py
β”‚   └── results/
β”‚       └── metrics.json
└── eval_agent_memory/                 # Agent workspace (if eval service used)
    β”œβ”€β”€ EVAL_AGENTS.md                 # Metric documentation
    β”œβ”€β”€ auxiliary_metrics.py           # Generated metrics code
    └── service_state.json             # Service state

πŸ’‘ Best Practices

  1. Always start with quick test to validate setup
  2. Use WandB tags for easy filtering of experiments
  3. Run ablations to demonstrate eval service impact
  4. Check eval_agent_memory to see what metrics were generated
  5. Compare WandB runs side-by-side for insights
  6. Save important results before cleanup

🎯 Experiment Workflow

  1. Start eval service (Terminal 1)

    bash scripts/dev/start_eval_server.sh
    
  2. Run quick test to validate (Terminal 2)

    bash scripts/dev/2_test_quick.sh
    
  3. Check results

    bash scripts/dev/4_check_results.sh
    
  4. Run full experiment if test passes

    bash scripts/dev/3_test_full.sh
    
  5. Compare with baseline (ablation)

    bash scripts/dev/3b_ablation_no_eval_service.sh
    
  6. Analyze on WandB

    • Compare runs
    • Export plots
    • Share results

πŸ“ˆ Expected Results

Quick Test (3 generations)

  • βœ… Completes in ~5 minutes
  • βœ… WandB run created
  • βœ… Metrics logged per generation
  • ⚠️ Agent likely not triggered (need 10+ gens)

Full Test (50 generations)

  • βœ… Completes in ~1 hour
  • βœ… Agent triggers 5 times (gen 10, 20, 30, 40, 50)
  • βœ… Auxiliary metrics appear in later generations
  • βœ… Metric descriptions in EVAL_AGENTS.md
  • βœ… Complete WandB run with all metrics

Need help? Check the main documentation or run with --help:

python scripts/dev/run_experiment.py --help

Frontier-CS Algorithmic Experiments

Parallel scripts for running Frontier-CS competitive programming problems with evolution.

Scripts

Script Description
run_frontier_cs_parallel_vanilla_server.sh Vanilla baseline via eval service (agent never triggers)
run_frontier_cs_parallel_with_agent.sh With eval agent (triggers every 5 generations)
run_frontier_cs.sh Single problem, manual run

Usage

# Vanilla baseline - all 172 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh

# Vanilla baseline - first 50 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50

# Vanilla baseline - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 10

# With eval agent - all problems
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh

# With eval agent - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh 50 10

Comparing Results

# Compare two runs (new layout)
python tasks/frontier_cs_entry/compare_experiments.py \
    results/frontier_cs_algorithmic/vanilla_g50_20260327_120000 \
    results/frontier_cs_algorithmic/agent_g50_20260327_130000

# Sort by score difference
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --sort diff

# Export to CSV
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --csv results/comparison.csv

Results Directory Structure

results/frontier_cs_algorithmic/
  vanilla_g50_20260327_120000/    # one run
    p0/                           # per-problem results
      evolution_db.sqlite
      gen_0/ gen_1/ ... gen_49/
    p1/
    ...
  agent_g50_20260327_130000/      # another run
    p0/
    p1/
    ...

Prerequisites

  • Docker running with go-judge service on port 8081
  • tasks/Frontier-CS/ checked out with problems and solutions