File size: 10,189 Bytes
1ca9dbd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 | # 🧪 Auxiliary Metrics Ablation Study Guide
## 实验设计:2x2 因子实验
### 完整实验矩阵
| 实验组 | Vision | Auxiliary | 脚本文件 | 目的 |
|--------|--------|-----------|----------|------|
| **Baseline** | ❌ | ❌ | `run_circle_packing_WITHOUT_vision.py` | 基准线 |
| **Aux Only** | ❌ | ✅ | `run_circle_packing_WITHOUT_vision_WITH_auxiliary.py` | **关键对比** |
| **Vision Only** | ✅ | ❌ | `run_circle_packing_WITH_vision.py` | Vision效果 |
| **Both** | ✅ | ✅ | (待创建) | 最优组合 |
---
## 🎯 关键对比:Baseline vs Aux Only
这是**最重要的对比**,因为它是**纯净的ablation**:
```
Baseline: NO vision + NO auxiliary
Aux Only: NO vision + WITH auxiliary
唯一差异:auxiliary metrics
```
**如果Aux Only > Baseline,则证明auxiliary metrics有效!**
---
## 📊 实验配置对比
### 相同部分(确保公平对比)
```python
# 两个实验完全相同:
num_generations = 200
max_parallel_jobs = 4
num_islands = 2
archive_size = 40
llm_models = ["native-gemini-2.5-flash", "native-gemini-2.5-pro"]
temperatures = [0.5, 0.7, 1.0]
# ... 所有其他超参数
```
### 不同部分(唯一变量)
#### Baseline (WITHOUT auxiliary)
```python
job_config = LocalJobConfig(
eval_program_path="examples/circle_packing/evaluate.py" # Ground truth only
)
# LLM看到:
Combined score: 2.456
centers_str: (0.123, 0.456), ...
```
#### Aux Only (WITH auxiliary)
```python
job_config = LocalJobConfig(
eval_program_path="examples/circle_packing/evaluate_with_auxiliary.py" # + Auxiliary
)
# LLM看到:
Combined score: 2.456
aux_spatial_uniformity: 0.752
aux_edge_utilization: 0.681
aux_density_variance: 0.694
aux_packing_efficiency: 0.734
aux_gap_analysis: 0.812
aux_geometric_quality: 0.778
💡 Recommendations:
1. Only 3/4 corners utilized. Place larger circles at unused corners.
2. Detected 18.8% unused space. Consider increasing radii in sparse regions.
```
---
## 🚀 运行实验
### Step 1: 运行Baseline(如果还没有)
```bash
cd /home/tengxiao/pj/ShinkaEvolve
source .venv/bin/activate
# 运行baseline
python my/run_circle_packing_WITHOUT_vision.py
```
**预期时间**:根据你的设置,可能需要几小时到几天
### Step 2: 运行Aux Only
```bash
# 运行auxiliary metrics版本
python my/run_circle_packing_WITHOUT_vision_WITH_auxiliary.py
```
**预期时间**:与baseline相同(auxiliary计算很快)
### Step 3: 对比结果
```bash
# 查看两个实验的结果
ls -lh examples/circle_packing/results/
```
---
## 📈 评估指标
### 主要指标
1. **最终最佳分数**
```bash
# Baseline
cat examples/circle_packing/results/results_circle_packing_WITHOUT_vision_*/best/results/metrics.json | grep combined_score
# Aux Only
cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_aux_*/best/results/metrics.json | grep combined_score
```
2. **收敛速度**
- 查看每个generation的best score
- 绘制学习曲线
- 看哪个更快达到高分
3. **最终排名**
```python
# 从数据库查询最佳程序
from shinka.database import ProgramDatabase
db_baseline = ProgramDatabase(config=..., db_path="baseline.sqlite")
db_aux = ProgramDatabase(config=..., db_path="aux.sqlite")
best_baseline = db_baseline.get_top_programs(n=1)[0]
best_aux = db_aux.get_top_programs(n=1)[0]
print(f"Baseline best: {best_baseline.combined_score:.4f}")
print(f"Aux best: {best_aux.combined_score:.4f}")
print(f"Improvement: {(best_aux.combined_score - best_baseline.combined_score):.4f}")
```
### 次要指标
1. **多样性**
- Archive中程序的多样性
- 是否探索了更多不同的策略
2. **稳定性**
- 分数的方差
- 是否更稳定地进步
3. **辅助指标的相关性**(仅Aux Only)
```python
# 分析auxiliary metrics与primary score的相关性
import pandas as pd
import matplotlib.pyplot as plt
# 读取所有generation的metrics
# 绘制scatter plots
# 看哪些auxiliary metrics最有预测性
```
---
## 📊 预期结果
### 如果Auxiliary Metrics有效
**预期观察**:
```
Baseline: 最佳分数 = 2.45
Aux Only: 最佳分数 = 2.55 ✅ 提升 ~4%
收敛曲线:
Baseline: 较慢,plateau更早
Aux Only: 较快,持续改进
LLM行为:
Baseline: 随机探索,缺乏方向
Aux Only: 针对性改进(如"improve edge_utilization")
```
### 如果效果不明显
**可能原因**:
1. Auxiliary metrics与primary score不相关
2. LLM没有有效利用auxiliary信息
3. 需要调整metric权重或feedback格式
**下一步**:
- 分析哪些auxiliary metrics最有用
- 调整text feedback的表述
- 考虑更强的auxiliary signal
---
## 🔍 详细分析脚本
### 比较最佳解决方案
```python
import json
from pathlib import Path
# 读取两个实验的最佳结果
baseline_metrics = json.load(open("results_baseline/best/results/metrics.json"))
aux_metrics = json.load(open("results_aux/best/results/metrics.json"))
print("=" * 60)
print("COMPARISON: Baseline vs Aux Only")
print("=" * 60)
print(f"\nPrimary Score:")
print(f" Baseline: {baseline_metrics['combined_score']:.4f}")
print(f" Aux Only: {aux_metrics['combined_score']:.4f}")
print(f" Δ: {aux_metrics['combined_score'] - baseline_metrics['combined_score']:.4f}")
if 'public' in aux_metrics:
print(f"\nAuxiliary Metrics (Aux Only):")
for key, value in aux_metrics['public'].items():
if key.startswith('aux_'):
print(f" {key}: {value:.3f}" if isinstance(value, float) else f" {key}: {value}")
```
### 绘制学习曲线
```python
import matplotlib.pyplot as plt
import sqlite3
def get_best_scores_per_gen(db_path):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT generation, MAX(combined_score) as best_score
FROM programs
WHERE correct = 1
GROUP BY generation
ORDER BY generation
""")
data = cursor.fetchall()
conn.close()
return [row[0] for row in data], [row[1] for row in data]
# 获取数据
gens_baseline, scores_baseline = get_best_scores_per_gen("baseline.sqlite")
gens_aux, scores_aux = get_best_scores_per_gen("aux.sqlite")
# 绘图
plt.figure(figsize=(12, 6))
plt.plot(gens_baseline, scores_baseline, label="Baseline (No Aux)", marker='o', alpha=0.7)
plt.plot(gens_aux, scores_aux, label="Aux Only", marker='s', alpha=0.7)
plt.xlabel("Generation")
plt.ylabel("Best Combined Score")
plt.title("Learning Curves: Baseline vs Auxiliary Metrics")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("learning_curves_comparison.png", dpi=150)
print("Saved: learning_curves_comparison.png")
```
---
## 🎯 成功标准
### 最小成功标准
- [ ] Aux Only 最佳分数 > Baseline 最佳分数
- [ ] 统计显著性(p < 0.05,如果运行多次重复)
### 理想成功标准
- [ ] Aux Only 提升 > 5%
- [ ] 收敛速度提升 > 20%
- [ ] 辅助指标与primary score有明显相关性
### 额外洞察
- [ ] 识别出最有用的auxiliary metrics
- [ ] 发现LLM如何利用auxiliary信息
- [ ] 验证programmatic gap detection的效果
---
## 📝 实验日志模板
```markdown
# Experiment Log
## Baseline (WITHOUT vision, WITHOUT aux)
- Start: YYYY-MM-DD HH:MM
- End: YYYY-MM-DD HH:MM
- Best Score: X.XXXX
- Notes: ...
## Aux Only (WITHOUT vision, WITH aux)
- Start: YYYY-MM-DD HH:MM
- End: YYYY-MM-DD HH:MM
- Best Score: X.XXXX
- Improvement over Baseline: +X.XXXX (+X.X%)
- Notes: ...
## Key Observations
1. ...
2. ...
## Auxiliary Metrics Analysis
- Most useful metrics: ...
- Correlations: ...
- LLM behavior changes: ...
## Conclusions
- Auxiliary metrics效果: [有效/无效/部分有效]
- 下一步: ...
```
---
## 🔮 后续实验(如果Aux有效)
### Phase 2: 完整2x2矩阵
```bash
# 1. WITH vision + WITHOUT aux (已有)
python my/run_circle_packing_WITH_vision.py
# 2. WITH vision + WITH aux (新建)
# 创建这个版本来测试vision + auxiliary的组合效果
```
### Phase 3: 参数调优
- 调整auxiliary metrics权重
- 优化text feedback格式
- 尝试不同的metric组合
### Phase 4: LLM生成Metrics
- 让LLM提出新的auxiliary metrics
- 自动筛选有用的metrics
- Co-evolution
---
## 💡 Pro Tips
### 1. 先跑短实验验证
```python
# 修改num_generations = 20 做快速测试
num_generations = 20 # Instead of 200
```
**目的**:快速验证系统工作正常
### 2. 监控进度
```bash
# 实时查看最新generation的分数
watch -n 60 'tail -20 examples/circle_packing/results/results_*/evolution_run.log | grep "best program"'
```
### 3. 中期检查
```bash
# 50代后检查趋势
python -c "
from shinka.database import ProgramDatabase, DatabaseConfig
db = ProgramDatabase(config=DatabaseConfig(...), db_path='...')
db.print_summary()
"
```
### 4. 保存检查点
```bash
# 定期备份数据库
cp evolution_db.sqlite evolution_db_backup_gen50.sqlite
```
---
## ✅ Checklist
### 开始前
- [ ] 确认baseline脚本存在
- [ ] 确认aux脚本创建成功
- [ ] 确认auxiliary eval系统测试通过
- [ ] 确认有足够的磁盘空间(~1GB per run)
- [ ] 确认有足够的时间(可能数小时)
### 运行中
- [ ] Baseline已启动
- [ ] Aux Only已启动(可并行或串行)
- [ ] 监控日志确认正常运行
- [ ] 检查auxiliary_analysis.json正确生成(Aux Only)
### 完成后
- [ ] 两个实验都成功完成
- [ ] 收集最佳分数
- [ ] 绘制学习曲线
- [ ] 分析auxiliary metrics相关性
- [ ] 记录实验日志
- [ ] 得出结论
---
## 📚 相关文件
- `run_circle_packing_WITHOUT_vision.py` - Baseline
- `run_circle_packing_WITHOUT_vision_WITH_auxiliary.py` - Aux Only
- `examples/circle_packing/auxiliary_eval.py` - Auxiliary metrics实现
- `examples/circle_packing/evaluate_with_auxiliary.py` - 集成evaluator
- `AUXILIARY_EVAL_README.md` - 完整文档
---
**Good luck with your ablation study! 🚀**
这是一个非常clean的实验设计,应该能清楚地证明auxiliary metrics的价值。
|