shinka-backup / scripts /dev /README.md
JustinTX's picture
Add files using upload-large-folder tool
3f6526a verified
# Development Testing Scripts
Quick scripts for testing the ShinkaEvolve + Eval Service integration with WandB logging.
## πŸš€ Quick Start
### Prerequisites
1. **Install dependencies:**
```bash
pip install wandb uv
```
2. **Setup WandB (first time only):**
```bash
wandb login
```
### Option 1: Quick Test (3 generations)
```bash
# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh
# Terminal 2: Run quick test
bash scripts/dev/2_test_quick.sh
```
### Option 2: Full Experiment (50 generations)
```bash
# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh
# Terminal 2: Run full experiment
bash scripts/dev/3_test_full.sh
```
### Option 3: Ablation Study (No Eval Service)
```bash
# No need for eval service
bash scripts/dev/3b_ablation_no_eval_service.sh
```
---
## πŸ“œ Script Reference
### Core Script: `run_experiment.py`
Universal Python script that runs experiments with configurable parameters.
**Key Features:**
- Single universal script for all experiments
- Command-line argument parsing
- WandB integration
- Eval service integration
- Automatic result directory creation
- Error handling and validation
**Usage:**
```bash
python scripts/dev/run_experiment.py \
--experiment-name "my_experiment" \
--num-generations 50 \
--use-wandb \
--use-eval-service
```
### Bash Wrappers
Bash scripts that configure hyperparameters and call `run_experiment.py`.
---
### 1. `start_eval_server.sh` (Recommended)
**Purpose:** Start the Eval Service with command-line configuration
**Configuration Variables:**
```bash
RESULTS_DIR="/tmp/eval_service"
PRIMARY_EVALUATOR="examples/circle_packing/evaluate_ori.py"
TRIGGER_MODE="periodic"
TRIGGER_INTERVAL=5
PORT=8765
```
**Usage:**
```bash
bash scripts/dev/start_eval_server.sh
```
**Customization:**
Edit the script directly to change parameters. All settings are at the top.
**Why this approach:**
- βœ… All config visible in one file
- βœ… Easy to create variants (copy & edit)
- βœ… No need to switch between files
### 1b. `start_eval_server_config5.sh` (Alternative)
**Purpose:** Start using YAML config file (trigger every 5 gens)
**Uses:** `eval_agent/ev2_service_config.yaml`
**When to use:**
- When you have a standard config you reuse
- When you want to version control configs separately
### 1c. `start_eval_server_config10.sh` (Alternative)
**Purpose:** Start using YAML config file (trigger every 10 gens)
**Note:** Currently shows how to override, but YAML doesn't support env var overrides yet
---
### 2. `2_test_quick.sh`
**Purpose:** Quick 3-generation test
**Configuration Variables:**
```bash
EXPERIMENT_NAME="quick_test"
NUM_GENERATIONS=3
MAX_PARALLEL_JOBS=2
META_INTERVAL=2
LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"
USE_EVAL_SERVICE="--use-eval-service"
USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-dev"
WANDB_TAGS="quick-test eval-service"
```
**Expected time:** ~2-5 minutes
---
### 3. `3_test_full.sh`
**Purpose:** Full 50-generation experiment
**Configuration Variables:**
```bash
EXPERIMENT_NAME="full_50gen"
NUM_GENERATIONS=50
MAX_PARALLEL_JOBS=4
META_INTERVAL=10
LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"
USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-experiments"
WANDB_TAGS="full-experiment eval-service circle-packing"
```
**Expected time:** ~30-60 minutes
---
### 3b. `3b_ablation_no_eval_service.sh`
**Purpose:** Baseline experiment without eval service
**Key Difference:** `USE_EVAL_SERVICE` is NOT set
**Use Case:** Ablation study to compare performance with/without eval service
---
### 4. `4_check_results.sh`
**Purpose:** Analyze experiment results
**Shows:**
- Best program score and validation
- Eval agent memory contents
- Auxiliary metrics statistics
- Database statistics
- WandB run links
**Usage:**
```bash
# Check most recent results
bash scripts/dev/4_check_results.sh
# Check specific directory
bash scripts/dev/4_check_results.sh examples/circle_packing/results/full_50gen_20240203_120000
```
---
### 5. `5_cleanup.sh`
**Purpose:** Clean up temporary test files
**Usage:**
```bash
bash scripts/dev/5_cleanup.sh
```
---
## πŸŽ›οΈ Hyperparameter Guide
### Common Parameters
| Parameter | Quick Test | Full Test | Description |
|-----------|-----------|-----------|-------------|
| `NUM_GENERATIONS` | 3 | 50 | Total generations to evolve |
| `MAX_PARALLEL_JOBS` | 2 | 4 | Concurrent evaluation jobs |
| `META_INTERVAL` | 2 | 10 | Meta-summarizer frequency |
| `LLM_MODELS` | gemini-2.5-flash/pro | gemini-2.5-flash/pro | LLM models to use |
| `LLM_SELECTION` | ucb1 | ucb1 | Dynamic LLM selection strategy |
| `LLM_TEMPERATURES` | 0.5 0.7 1.0 | 0.5 0.7 1.0 | LLM sampling temperatures |
### Eval Service Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `USE_EVAL_SERVICE` | enabled | Enable eval service |
| `EVAL_SERVICE_URL` | http://localhost:8765 | Service URL |
### WandB Parameters
| Parameter | Quick Test | Full Test | Description |
|-----------|-----------|-----------|-------------|
| `USE_WANDB` | enabled | enabled | Enable WandB logging |
| `WANDB_PROJECT` | shinkaevolve-dev | shinkaevolve-experiments | WandB project |
| `WANDB_TAGS` | quick-test | full-experiment eval-service | Space-separated tags |
---
## πŸ”§ Customization Examples
### Example 1: Change Models
```bash
# In 3_test_full.sh
LLM_MODELS="gpt-4o claude-3-5-sonnet-20241022"
LLM_SELECTION="thompson" # or "ucb1", "epsilon_greedy"
LLM_TEMPERATURES="0.5 0.7"
```
### Example 2: Change WandB Project
```bash
# In 2_test_quick.sh
WANDB_PROJECT="my-research-project"
WANDB_TAGS="my-tag another-tag"
```
### Example 3: Change Agent Trigger Frequency
```yaml
# In eval_agent/ev2_service_config.yaml
strategy:
trigger_interval: 10 # Change from 5 to 10
```
### Example 4: Run More Generations
```bash
# In 3_test_full.sh
NUM_GENERATIONS=100
```
### Example 5: Disable WandB
```bash
# In 2_test_quick.sh
USE_WANDB="" # Comment out or set empty
```
---
## πŸ“Š WandB Integration
### What Gets Logged
1. **Metrics:**
- Combined score per generation
- Best score over time
- Correct/incorrect programs
- Auxiliary metrics (if eval service enabled)
2. **System Info:**
- Hyperparameters
- Model configuration
- Eval service status
3. **Artifacts:**
- Best program code
- Evolution database
- Agent-generated metrics
### Viewing Results
After running an experiment:
```bash
# Get WandB URL from terminal output
# Or visit: https://wandb.ai/YOUR_ENTITY/YOUR_PROJECT
```
### Comparing Runs
WandB automatically tracks all runs in the same project, allowing easy comparison:
- Baseline vs. Eval Service
- Different hyperparameters
- Different models
---
## πŸ” Verification Checklist
After running an experiment, check:
- [ ] **Eval Service Running** (for eval service experiments)
```bash
curl http://localhost:8765/api/v1/status | jq
```
- [ ] **Experiment Completed**
```bash
bash scripts/dev/4_check_results.sh
```
- [ ] **Best Program Valid**
```bash
cat RESULTS_DIR/best/results/correct.json
# Should show "correct": true
```
- [ ] **Auxiliary Metrics Present** (for eval service experiments)
```bash
cat RESULTS_DIR/gen_20/results/metrics.json | jq '.auxiliary'
# Should show metrics after agent triggers
```
- [ ] **WandB Run Logged**
- Check WandB dashboard
- Verify metrics are being logged
- [ ] **Agent Documentation Generated** (for eval service experiments)
```bash
cat RESULTS_DIR/eval_agent_memory/EVAL_AGENTS.md | head -50
```
---
## πŸ› Troubleshooting
### Error: "Eval service not running"
**Solution:**
```bash
bash scripts/dev/start_eval_server.sh
```
### Error: "wandb not found"
**Solution:**
```bash
pip install wandb
wandb login
```
### Error: "Port 8765 already in use"
**Solution:**
```bash
lsof -ti:8765 | xargs kill -9
```
### WandB not logging
**Solution:**
```bash
# Re-login to WandB
wandb login
# Check if USE_WANDB is set in bash script
echo $USE_WANDB
```
---
## πŸ“ Results Structure
```
examples/circle_packing/results/{EXPERIMENT_NAME}_{TIMESTAMP}/
β”œβ”€β”€ evolution_db.sqlite # Evolution database
β”œβ”€β”€ evolution_run.log # Detailed logs
β”œβ”€β”€ experiment_config.yaml # Configuration backup
β”œβ”€β”€ gen_0/
β”‚ β”œβ”€β”€ main.py # Generated code
β”‚ └── results/
β”‚ β”œβ”€β”€ metrics.json # All metrics (primary + auxiliary)
β”‚ └── correct.json # Validation status
β”œβ”€β”€ gen_1/ ... gen_N/
β”œβ”€β”€ best/ # Best program (symlink)
β”‚ β”œβ”€β”€ main.py
β”‚ └── results/
β”‚ └── metrics.json
└── eval_agent_memory/ # Agent workspace (if eval service used)
β”œβ”€β”€ EVAL_AGENTS.md # Metric documentation
β”œβ”€β”€ auxiliary_metrics.py # Generated metrics code
└── service_state.json # Service state
```
---
## πŸ’‘ Best Practices
1. **Always start with quick test** to validate setup
2. **Use WandB tags** for easy filtering of experiments
3. **Run ablations** to demonstrate eval service impact
4. **Check eval_agent_memory** to see what metrics were generated
5. **Compare WandB runs** side-by-side for insights
6. **Save important results** before cleanup
---
## 🎯 Experiment Workflow
1. **Start eval service** (Terminal 1)
```bash
bash scripts/dev/start_eval_server.sh
```
2. **Run quick test** to validate (Terminal 2)
```bash
bash scripts/dev/2_test_quick.sh
```
3. **Check results**
```bash
bash scripts/dev/4_check_results.sh
```
4. **Run full experiment** if test passes
```bash
bash scripts/dev/3_test_full.sh
```
5. **Compare with baseline** (ablation)
```bash
bash scripts/dev/3b_ablation_no_eval_service.sh
```
6. **Analyze on WandB**
- Compare runs
- Export plots
- Share results
---
## πŸ“ˆ Expected Results
### Quick Test (3 generations)
- βœ… Completes in ~5 minutes
- βœ… WandB run created
- βœ… Metrics logged per generation
- ⚠️ Agent likely not triggered (need 10+ gens)
### Full Test (50 generations)
- βœ… Completes in ~1 hour
- βœ… Agent triggers 5 times (gen 10, 20, 30, 40, 50)
- βœ… Auxiliary metrics appear in later generations
- βœ… Metric descriptions in EVAL_AGENTS.md
- βœ… Complete WandB run with all metrics
---
**Need help?** Check the main documentation or run with `--help`:
```bash
python scripts/dev/run_experiment.py --help
```
---
## Frontier-CS Algorithmic Experiments
Parallel scripts for running Frontier-CS competitive programming problems with evolution.
### Scripts
| Script | Description |
|--------|-------------|
| `run_frontier_cs_parallel_vanilla_server.sh` | Vanilla baseline via eval service (agent never triggers) |
| `run_frontier_cs_parallel_with_agent.sh` | With eval agent (triggers every 5 generations) |
| `run_frontier_cs.sh` | Single problem, manual run |
### Usage
```bash
# Vanilla baseline - all 172 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh
# Vanilla baseline - first 50 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50
# Vanilla baseline - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 10
# With eval agent - all problems
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh
# With eval agent - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh 50 10
```
### Comparing Results
```bash
# Compare two runs (new layout)
python tasks/frontier_cs_entry/compare_experiments.py \
results/frontier_cs_algorithmic/vanilla_g50_20260327_120000 \
results/frontier_cs_algorithmic/agent_g50_20260327_130000
# Sort by score difference
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --sort diff
# Export to CSV
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --csv results/comparison.csv
```
### Results Directory Structure
```
results/frontier_cs_algorithmic/
vanilla_g50_20260327_120000/ # one run
p0/ # per-problem results
evolution_db.sqlite
gen_0/ gen_1/ ... gen_49/
p1/
...
agent_g50_20260327_130000/ # another run
p0/
p1/
...
```
### Prerequisites
- Docker running with go-judge service on port 8081
- `tasks/Frontier-CS/` checked out with problems and solutions