File size: 12,707 Bytes

3f6526a

# Development Testing Scripts

Quick scripts for testing the ShinkaEvolve + Eval Service integration with WandB logging.

## 🚀 Quick Start

### Prerequisites

1. **Install dependencies:**
   ```bash
   pip install wandb uv
   ```

2. **Setup WandB (first time only):**
   ```bash
   wandb login
   ```

### Option 1: Quick Test (3 generations)

```bash
# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh

# Terminal 2: Run quick test
bash scripts/dev/2_test_quick.sh
```

### Option 2: Full Experiment (50 generations)

```bash
# Terminal 1: Start eval service
bash scripts/dev/start_eval_server.sh

# Terminal 2: Run full experiment
bash scripts/dev/3_test_full.sh
```

### Option 3: Ablation Study (No Eval Service)

```bash
# No need for eval service
bash scripts/dev/3b_ablation_no_eval_service.sh
```

---

## 📜 Script Reference

### Core Script: `run_experiment.py`

Universal Python script that runs experiments with configurable parameters.

**Key Features:**
- Single universal script for all experiments
- Command-line argument parsing
- WandB integration
- Eval service integration
- Automatic result directory creation
- Error handling and validation

**Usage:**
```bash
python scripts/dev/run_experiment.py \
    --experiment-name "my_experiment" \
    --num-generations 50 \
    --use-wandb \
    --use-eval-service
```

### Bash Wrappers

Bash scripts that configure hyperparameters and call `run_experiment.py`.

---

### 1. `start_eval_server.sh` (Recommended)
**Purpose:** Start the Eval Service with command-line configuration

**Configuration Variables:**
```bash
RESULTS_DIR="/tmp/eval_service"
PRIMARY_EVALUATOR="examples/circle_packing/evaluate_ori.py"
TRIGGER_MODE="periodic"
TRIGGER_INTERVAL=5
PORT=8765
```

**Usage:**
```bash
bash scripts/dev/start_eval_server.sh
```

**Customization:**
Edit the script directly to change parameters. All settings are at the top.

**Why this approach:**
- ✅ All config visible in one file
- ✅ Easy to create variants (copy & edit)
- ✅ No need to switch between files

### 1b. `start_eval_server_config5.sh` (Alternative)
**Purpose:** Start using YAML config file (trigger every 5 gens)

**Uses:** `eval_agent/ev2_service_config.yaml`

**When to use:**
- When you have a standard config you reuse
- When you want to version control configs separately

### 1c. `start_eval_server_config10.sh` (Alternative)
**Purpose:** Start using YAML config file (trigger every 10 gens)

**Note:** Currently shows how to override, but YAML doesn't support env var overrides yet

---

### 2. `2_test_quick.sh`
**Purpose:** Quick 3-generation test

**Configuration Variables:**
```bash
EXPERIMENT_NAME="quick_test"
NUM_GENERATIONS=3
MAX_PARALLEL_JOBS=2
META_INTERVAL=2

LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"

USE_EVAL_SERVICE="--use-eval-service"
USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-dev"
WANDB_TAGS="quick-test eval-service"
```

**Expected time:** ~2-5 minutes

---

### 3. `3_test_full.sh`
**Purpose:** Full 50-generation experiment

**Configuration Variables:**
```bash
EXPERIMENT_NAME="full_50gen"
NUM_GENERATIONS=50
MAX_PARALLEL_JOBS=4
META_INTERVAL=10

LLM_MODELS="native-gemini-2.5-flash native-gemini-2.5-pro"
LLM_SELECTION="ucb1"
LLM_TEMPERATURES="0.5 0.7 1.0"

USE_WANDB="--use-wandb"
WANDB_PROJECT="shinkaevolve-experiments"
WANDB_TAGS="full-experiment eval-service circle-packing"
```

**Expected time:** ~30-60 minutes

---

### 3b. `3b_ablation_no_eval_service.sh`
**Purpose:** Baseline experiment without eval service

**Key Difference:** `USE_EVAL_SERVICE` is NOT set

**Use Case:** Ablation study to compare performance with/without eval service

---

### 4. `4_check_results.sh`
**Purpose:** Analyze experiment results

**Shows:**
- Best program score and validation
- Eval agent memory contents
- Auxiliary metrics statistics
- Database statistics
- WandB run links

**Usage:**
```bash
# Check most recent results
bash scripts/dev/4_check_results.sh

# Check specific directory
bash scripts/dev/4_check_results.sh examples/circle_packing/results/full_50gen_20240203_120000
```

---

### 5. `5_cleanup.sh`
**Purpose:** Clean up temporary test files

**Usage:**
```bash
bash scripts/dev/5_cleanup.sh
```

---

## 🎛️ Hyperparameter Guide

### Common Parameters

| Parameter | Quick Test | Full Test | Description |
|-----------|-----------|-----------|-------------|
| `NUM_GENERATIONS` | 3 | 50 | Total generations to evolve |
| `MAX_PARALLEL_JOBS` | 2 | 4 | Concurrent evaluation jobs |
| `META_INTERVAL` | 2 | 10 | Meta-summarizer frequency |
| `LLM_MODELS` | gemini-2.5-flash/pro | gemini-2.5-flash/pro | LLM models to use |
| `LLM_SELECTION` | ucb1 | ucb1 | Dynamic LLM selection strategy |
| `LLM_TEMPERATURES` | 0.5 0.7 1.0 | 0.5 0.7 1.0 | LLM sampling temperatures |

### Eval Service Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `USE_EVAL_SERVICE` | enabled | Enable eval service |
| `EVAL_SERVICE_URL` | http://localhost:8765 | Service URL |

### WandB Parameters

| Parameter | Quick Test | Full Test | Description |
|-----------|-----------|-----------|-------------|
| `USE_WANDB` | enabled | enabled | Enable WandB logging |
| `WANDB_PROJECT` | shinkaevolve-dev | shinkaevolve-experiments | WandB project |
| `WANDB_TAGS` | quick-test | full-experiment eval-service | Space-separated tags |

---

## 🔧 Customization Examples

### Example 1: Change Models

```bash
# In 3_test_full.sh
LLM_MODELS="gpt-4o claude-3-5-sonnet-20241022"
LLM_SELECTION="thompson"  # or "ucb1", "epsilon_greedy"
LLM_TEMPERATURES="0.5 0.7"
```

### Example 2: Change WandB Project

```bash
# In 2_test_quick.sh
WANDB_PROJECT="my-research-project"
WANDB_TAGS="my-tag another-tag"
```

### Example 3: Change Agent Trigger Frequency

```yaml
# In eval_agent/ev2_service_config.yaml
strategy:
  trigger_interval: 10  # Change from 5 to 10
```

### Example 4: Run More Generations

```bash
# In 3_test_full.sh
NUM_GENERATIONS=100
```

### Example 5: Disable WandB

```bash
# In 2_test_quick.sh
USE_WANDB=""  # Comment out or set empty
```

---

## 📊 WandB Integration

### What Gets Logged

1. **Metrics:**
   - Combined score per generation
   - Best score over time
   - Correct/incorrect programs
   - Auxiliary metrics (if eval service enabled)

2. **System Info:**
   - Hyperparameters
   - Model configuration
   - Eval service status

3. **Artifacts:**
   - Best program code
   - Evolution database
   - Agent-generated metrics

### Viewing Results

After running an experiment:
```bash
# Get WandB URL from terminal output
# Or visit: https://wandb.ai/YOUR_ENTITY/YOUR_PROJECT
```

### Comparing Runs

WandB automatically tracks all runs in the same project, allowing easy comparison:
- Baseline vs. Eval Service
- Different hyperparameters
- Different models

---

## 🔍 Verification Checklist

After running an experiment, check:

- [ ] **Eval Service Running** (for eval service experiments)
  ```bash
  curl http://localhost:8765/api/v1/status | jq
  ```

- [ ] **Experiment Completed**
  ```bash
  bash scripts/dev/4_check_results.sh
  ```

- [ ] **Best Program Valid**
  ```bash
  cat RESULTS_DIR/best/results/correct.json
  # Should show "correct": true
  ```

- [ ] **Auxiliary Metrics Present** (for eval service experiments)
  ```bash
  cat RESULTS_DIR/gen_20/results/metrics.json | jq '.auxiliary'
  # Should show metrics after agent triggers
  ```

- [ ] **WandB Run Logged**
  - Check WandB dashboard
  - Verify metrics are being logged

- [ ] **Agent Documentation Generated** (for eval service experiments)
  ```bash
  cat RESULTS_DIR/eval_agent_memory/EVAL_AGENTS.md | head -50
  ```

---

## 🐛 Troubleshooting

### Error: "Eval service not running"
**Solution:**
```bash
bash scripts/dev/start_eval_server.sh
```

### Error: "wandb not found"
**Solution:**
```bash
pip install wandb
wandb login
```

### Error: "Port 8765 already in use"
**Solution:**
```bash
lsof -ti:8765 | xargs kill -9
```

### WandB not logging
**Solution:**
```bash
# Re-login to WandB
wandb login

# Check if USE_WANDB is set in bash script
echo $USE_WANDB
```

---

## 📁 Results Structure

```
examples/circle_packing/results/{EXPERIMENT_NAME}_{TIMESTAMP}/
├── evolution_db.sqlite                 # Evolution database
├── evolution_run.log                   # Detailed logs
├── experiment_config.yaml              # Configuration backup
├── gen_0/
│   ├── main.py                        # Generated code
│   └── results/
│       ├── metrics.json               # All metrics (primary + auxiliary)
│       └── correct.json               # Validation status
├── gen_1/ ... gen_N/
├── best/                              # Best program (symlink)
│   ├── main.py
│   └── results/
│       └── metrics.json
└── eval_agent_memory/                 # Agent workspace (if eval service used)
    ├── EVAL_AGENTS.md                 # Metric documentation
    ├── auxiliary_metrics.py           # Generated metrics code
    └── service_state.json             # Service state
```

---

## 💡 Best Practices

1. **Always start with quick test** to validate setup
2. **Use WandB tags** for easy filtering of experiments
3. **Run ablations** to demonstrate eval service impact
4. **Check eval_agent_memory** to see what metrics were generated
5. **Compare WandB runs** side-by-side for insights
6. **Save important results** before cleanup

---

## 🎯 Experiment Workflow

1. **Start eval service** (Terminal 1)
   ```bash
   bash scripts/dev/start_eval_server.sh
   ```

2. **Run quick test** to validate (Terminal 2)
   ```bash
   bash scripts/dev/2_test_quick.sh
   ```

3. **Check results**
   ```bash
   bash scripts/dev/4_check_results.sh
   ```

4. **Run full experiment** if test passes
   ```bash
   bash scripts/dev/3_test_full.sh
   ```

5. **Compare with baseline** (ablation)
   ```bash
   bash scripts/dev/3b_ablation_no_eval_service.sh
   ```

6. **Analyze on WandB**
   - Compare runs
   - Export plots
   - Share results

---

## 📈 Expected Results

### Quick Test (3 generations)
- ✅ Completes in ~5 minutes
- ✅ WandB run created
- ✅ Metrics logged per generation
- ⚠️ Agent likely not triggered (need 10+ gens)

### Full Test (50 generations)
- ✅ Completes in ~1 hour
- ✅ Agent triggers 5 times (gen 10, 20, 30, 40, 50)
- ✅ Auxiliary metrics appear in later generations
- ✅ Metric descriptions in EVAL_AGENTS.md
- ✅ Complete WandB run with all metrics

---

**Need help?** Check the main documentation or run with `--help`:
```bash
python scripts/dev/run_experiment.py --help
```

---

## Frontier-CS Algorithmic Experiments

Parallel scripts for running Frontier-CS competitive programming problems with evolution.

### Scripts

| Script | Description |
|--------|-------------|
| `run_frontier_cs_parallel_vanilla_server.sh` | Vanilla baseline via eval service (agent never triggers) |
| `run_frontier_cs_parallel_with_agent.sh` | With eval agent (triggers every 5 generations) |
| `run_frontier_cs.sh` | Single problem, manual run |

### Usage

```bash
# Vanilla baseline - all 172 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh

# Vanilla baseline - first 50 problems, 20 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50

# Vanilla baseline - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_vanilla_server.sh 50 10

# With eval agent - all problems
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh

# With eval agent - first 50 problems, 10 parallel
bash scripts/dev/run_frontier_cs_parallel_with_agent.sh 50 10
```

### Comparing Results

```bash
# Compare two runs (new layout)
python tasks/frontier_cs_entry/compare_experiments.py \
    results/frontier_cs_algorithmic/vanilla_g50_20260327_120000 \
    results/frontier_cs_algorithmic/agent_g50_20260327_130000

# Sort by score difference
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --sort diff

# Export to CSV
python tasks/frontier_cs_entry/compare_experiments.py dir_a dir_b --csv results/comparison.csv
```

### Results Directory Structure

```
results/frontier_cs_algorithmic/
  vanilla_g50_20260327_120000/    # one run
    p0/                           # per-problem results
      evolution_db.sqlite
      gen_0/ gen_1/ ... gen_49/
    p1/
    ...
  agent_g50_20260327_130000/      # another run
    p0/
    p1/
    ...
```

### Prerequisites

- Docker running with go-judge service on port 8081
- `tasks/Frontier-CS/` checked out with problems and solutions