shinka-backup / my /ABLATION_STUDY_GUIDE.md
JustinTX's picture
Add files using upload-large-folder tool
1ca9dbd verified
# 🧪 Auxiliary Metrics Ablation Study Guide
## 实验设计:2x2 因子实验
### 完整实验矩阵
| 实验组 | Vision | Auxiliary | 脚本文件 | 目的 |
|--------|--------|-----------|----------|------|
| **Baseline** | ❌ | ❌ | `run_circle_packing_WITHOUT_vision.py` | 基准线 |
| **Aux Only** | ❌ | ✅ | `run_circle_packing_WITHOUT_vision_WITH_auxiliary.py` | **关键对比** |
| **Vision Only** | ✅ | ❌ | `run_circle_packing_WITH_vision.py` | Vision效果 |
| **Both** | ✅ | ✅ | (待创建) | 最优组合 |
---
## 🎯 关键对比:Baseline vs Aux Only
这是**最重要的对比**,因为它是**纯净的ablation**
```
Baseline: NO vision + NO auxiliary
Aux Only: NO vision + WITH auxiliary
唯一差异:auxiliary metrics
```
**如果Aux Only > Baseline,则证明auxiliary metrics有效!**
---
## 📊 实验配置对比
### 相同部分(确保公平对比)
```python
# 两个实验完全相同:
num_generations = 200
max_parallel_jobs = 4
num_islands = 2
archive_size = 40
llm_models = ["native-gemini-2.5-flash", "native-gemini-2.5-pro"]
temperatures = [0.5, 0.7, 1.0]
# ... 所有其他超参数
```
### 不同部分(唯一变量)
#### Baseline (WITHOUT auxiliary)
```python
job_config = LocalJobConfig(
eval_program_path="examples/circle_packing/evaluate.py" # Ground truth only
)
# LLM看到:
Combined score: 2.456
centers_str: (0.123, 0.456), ...
```
#### Aux Only (WITH auxiliary)
```python
job_config = LocalJobConfig(
eval_program_path="examples/circle_packing/evaluate_with_auxiliary.py" # + Auxiliary
)
# LLM看到:
Combined score: 2.456
aux_spatial_uniformity: 0.752
aux_edge_utilization: 0.681
aux_density_variance: 0.694
aux_packing_efficiency: 0.734
aux_gap_analysis: 0.812
aux_geometric_quality: 0.778
💡 Recommendations:
1. Only 3/4 corners utilized. Place larger circles at unused corners.
2. Detected 18.8% unused space. Consider increasing radii in sparse regions.
```
---
## 🚀 运行实验
### Step 1: 运行Baseline(如果还没有)
```bash
cd /home/tengxiao/pj/ShinkaEvolve
source .venv/bin/activate
# 运行baseline
python my/run_circle_packing_WITHOUT_vision.py
```
**预期时间**:根据你的设置,可能需要几小时到几天
### Step 2: 运行Aux Only
```bash
# 运行auxiliary metrics版本
python my/run_circle_packing_WITHOUT_vision_WITH_auxiliary.py
```
**预期时间**:与baseline相同(auxiliary计算很快)
### Step 3: 对比结果
```bash
# 查看两个实验的结果
ls -lh examples/circle_packing/results/
```
---
## 📈 评估指标
### 主要指标
1. **最终最佳分数**
```bash
# Baseline
cat examples/circle_packing/results/results_circle_packing_WITHOUT_vision_*/best/results/metrics.json | grep combined_score
# Aux Only
cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_aux_*/best/results/metrics.json | grep combined_score
```
2. **收敛速度**
- 查看每个generation的best score
- 绘制学习曲线
- 看哪个更快达到高分
3. **最终排名**
```python
# 从数据库查询最佳程序
from shinka.database import ProgramDatabase
db_baseline = ProgramDatabase(config=..., db_path="baseline.sqlite")
db_aux = ProgramDatabase(config=..., db_path="aux.sqlite")
best_baseline = db_baseline.get_top_programs(n=1)[0]
best_aux = db_aux.get_top_programs(n=1)[0]
print(f"Baseline best: {best_baseline.combined_score:.4f}")
print(f"Aux best: {best_aux.combined_score:.4f}")
print(f"Improvement: {(best_aux.combined_score - best_baseline.combined_score):.4f}")
```
### 次要指标
1. **多样性**
- Archive中程序的多样性
- 是否探索了更多不同的策略
2. **稳定性**
- 分数的方差
- 是否更稳定地进步
3. **辅助指标的相关性**(仅Aux Only)
```python
# 分析auxiliary metrics与primary score的相关性
import pandas as pd
import matplotlib.pyplot as plt
# 读取所有generation的metrics
# 绘制scatter plots
# 看哪些auxiliary metrics最有预测性
```
---
## 📊 预期结果
### 如果Auxiliary Metrics有效
**预期观察**
```
Baseline: 最佳分数 = 2.45
Aux Only: 最佳分数 = 2.55 ✅ 提升 ~4%
收敛曲线:
Baseline: 较慢,plateau更早
Aux Only: 较快,持续改进
LLM行为:
Baseline: 随机探索,缺乏方向
Aux Only: 针对性改进(如"improve edge_utilization")
```
### 如果效果不明显
**可能原因**
1. Auxiliary metrics与primary score不相关
2. LLM没有有效利用auxiliary信息
3. 需要调整metric权重或feedback格式
**下一步**
- 分析哪些auxiliary metrics最有用
- 调整text feedback的表述
- 考虑更强的auxiliary signal
---
## 🔍 详细分析脚本
### 比较最佳解决方案
```python
import json
from pathlib import Path
# 读取两个实验的最佳结果
baseline_metrics = json.load(open("results_baseline/best/results/metrics.json"))
aux_metrics = json.load(open("results_aux/best/results/metrics.json"))
print("=" * 60)
print("COMPARISON: Baseline vs Aux Only")
print("=" * 60)
print(f"\nPrimary Score:")
print(f" Baseline: {baseline_metrics['combined_score']:.4f}")
print(f" Aux Only: {aux_metrics['combined_score']:.4f}")
print(f" Δ: {aux_metrics['combined_score'] - baseline_metrics['combined_score']:.4f}")
if 'public' in aux_metrics:
print(f"\nAuxiliary Metrics (Aux Only):")
for key, value in aux_metrics['public'].items():
if key.startswith('aux_'):
print(f" {key}: {value:.3f}" if isinstance(value, float) else f" {key}: {value}")
```
### 绘制学习曲线
```python
import matplotlib.pyplot as plt
import sqlite3
def get_best_scores_per_gen(db_path):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT generation, MAX(combined_score) as best_score
FROM programs
WHERE correct = 1
GROUP BY generation
ORDER BY generation
""")
data = cursor.fetchall()
conn.close()
return [row[0] for row in data], [row[1] for row in data]
# 获取数据
gens_baseline, scores_baseline = get_best_scores_per_gen("baseline.sqlite")
gens_aux, scores_aux = get_best_scores_per_gen("aux.sqlite")
# 绘图
plt.figure(figsize=(12, 6))
plt.plot(gens_baseline, scores_baseline, label="Baseline (No Aux)", marker='o', alpha=0.7)
plt.plot(gens_aux, scores_aux, label="Aux Only", marker='s', alpha=0.7)
plt.xlabel("Generation")
plt.ylabel("Best Combined Score")
plt.title("Learning Curves: Baseline vs Auxiliary Metrics")
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig("learning_curves_comparison.png", dpi=150)
print("Saved: learning_curves_comparison.png")
```
---
## 🎯 成功标准
### 最小成功标准
- [ ] Aux Only 最佳分数 > Baseline 最佳分数
- [ ] 统计显著性(p < 0.05,如果运行多次重复)
### 理想成功标准
- [ ] Aux Only 提升 > 5%
- [ ] 收敛速度提升 > 20%
- [ ] 辅助指标与primary score有明显相关性
### 额外洞察
- [ ] 识别出最有用的auxiliary metrics
- [ ] 发现LLM如何利用auxiliary信息
- [ ] 验证programmatic gap detection的效果
---
## 📝 实验日志模板
```markdown
# Experiment Log
## Baseline (WITHOUT vision, WITHOUT aux)
- Start: YYYY-MM-DD HH:MM
- End: YYYY-MM-DD HH:MM
- Best Score: X.XXXX
- Notes: ...
## Aux Only (WITHOUT vision, WITH aux)
- Start: YYYY-MM-DD HH:MM
- End: YYYY-MM-DD HH:MM
- Best Score: X.XXXX
- Improvement over Baseline: +X.XXXX (+X.X%)
- Notes: ...
## Key Observations
1. ...
2. ...
## Auxiliary Metrics Analysis
- Most useful metrics: ...
- Correlations: ...
- LLM behavior changes: ...
## Conclusions
- Auxiliary metrics效果: [有效/无效/部分有效]
- 下一步: ...
```
---
## 🔮 后续实验(如果Aux有效)
### Phase 2: 完整2x2矩阵
```bash
# 1. WITH vision + WITHOUT aux (已有)
python my/run_circle_packing_WITH_vision.py
# 2. WITH vision + WITH aux (新建)
# 创建这个版本来测试vision + auxiliary的组合效果
```
### Phase 3: 参数调优
- 调整auxiliary metrics权重
- 优化text feedback格式
- 尝试不同的metric组合
### Phase 4: LLM生成Metrics
- 让LLM提出新的auxiliary metrics
- 自动筛选有用的metrics
- Co-evolution
---
## 💡 Pro Tips
### 1. 先跑短实验验证
```python
# 修改num_generations = 20 做快速测试
num_generations = 20 # Instead of 200
```
**目的**:快速验证系统工作正常
### 2. 监控进度
```bash
# 实时查看最新generation的分数
watch -n 60 'tail -20 examples/circle_packing/results/results_*/evolution_run.log | grep "best program"'
```
### 3. 中期检查
```bash
# 50代后检查趋势
python -c "
from shinka.database import ProgramDatabase, DatabaseConfig
db = ProgramDatabase(config=DatabaseConfig(...), db_path='...')
db.print_summary()
"
```
### 4. 保存检查点
```bash
# 定期备份数据库
cp evolution_db.sqlite evolution_db_backup_gen50.sqlite
```
---
## ✅ Checklist
### 开始前
- [ ] 确认baseline脚本存在
- [ ] 确认aux脚本创建成功
- [ ] 确认auxiliary eval系统测试通过
- [ ] 确认有足够的磁盘空间(~1GB per run)
- [ ] 确认有足够的时间(可能数小时)
### 运行中
- [ ] Baseline已启动
- [ ] Aux Only已启动(可并行或串行)
- [ ] 监控日志确认正常运行
- [ ] 检查auxiliary_analysis.json正确生成(Aux Only)
### 完成后
- [ ] 两个实验都成功完成
- [ ] 收集最佳分数
- [ ] 绘制学习曲线
- [ ] 分析auxiliary metrics相关性
- [ ] 记录实验日志
- [ ] 得出结论
---
## 📚 相关文件
- `run_circle_packing_WITHOUT_vision.py` - Baseline
- `run_circle_packing_WITHOUT_vision_WITH_auxiliary.py` - Aux Only
- `examples/circle_packing/auxiliary_eval.py` - Auxiliary metrics实现
- `examples/circle_packing/evaluate_with_auxiliary.py` - 集成evaluator
- `AUXILIARY_EVAL_README.md` - 完整文档
---
**Good luck with your ablation study! 🚀**
这是一个非常clean的实验设计,应该能清楚地证明auxiliary metrics的价值。