| # 🧪 Auxiliary Metrics Ablation Study Guide |
|
|
| ## 实验设计:2x2 因子实验 |
|
|
| ### 完整实验矩阵 |
|
|
| | 实验组 | Vision | Auxiliary | 脚本文件 | 目的 | |
| |--------|--------|-----------|----------|------| |
| | **Baseline** | ❌ | ❌ | `run_circle_packing_WITHOUT_vision.py` | 基准线 | |
| | **Aux Only** | ❌ | ✅ | `run_circle_packing_WITHOUT_vision_WITH_auxiliary.py` | **关键对比** | |
| | **Vision Only** | ✅ | ❌ | `run_circle_packing_WITH_vision.py` | Vision效果 | |
| | **Both** | ✅ | ✅ | (待创建) | 最优组合 | |
|
|
| --- |
|
|
| ## 🎯 关键对比:Baseline vs Aux Only |
|
|
| 这是**最重要的对比**,因为它是**纯净的ablation**: |
|
|
| ``` |
| Baseline: NO vision + NO auxiliary |
| Aux Only: NO vision + WITH auxiliary |
| |
| 唯一差异:auxiliary metrics |
| ``` |
|
|
| **如果Aux Only > Baseline,则证明auxiliary metrics有效!** |
|
|
| --- |
|
|
| ## 📊 实验配置对比 |
|
|
| ### 相同部分(确保公平对比) |
|
|
| ```python |
| # 两个实验完全相同: |
| num_generations = 200 |
| max_parallel_jobs = 4 |
| num_islands = 2 |
| archive_size = 40 |
| llm_models = ["native-gemini-2.5-flash", "native-gemini-2.5-pro"] |
| temperatures = [0.5, 0.7, 1.0] |
| # ... 所有其他超参数 |
| ``` |
|
|
| ### 不同部分(唯一变量) |
|
|
| #### Baseline (WITHOUT auxiliary) |
| ```python |
| job_config = LocalJobConfig( |
| eval_program_path="examples/circle_packing/evaluate.py" # Ground truth only |
| ) |
| |
| # LLM看到: |
| Combined score: 2.456 |
| centers_str: (0.123, 0.456), ... |
| ``` |
|
|
| #### Aux Only (WITH auxiliary) |
| ```python |
| job_config = LocalJobConfig( |
| eval_program_path="examples/circle_packing/evaluate_with_auxiliary.py" # + Auxiliary |
| ) |
| |
| # LLM看到: |
| Combined score: 2.456 |
| aux_spatial_uniformity: 0.752 |
| aux_edge_utilization: 0.681 |
| aux_density_variance: 0.694 |
| aux_packing_efficiency: 0.734 |
| aux_gap_analysis: 0.812 |
| aux_geometric_quality: 0.778 |
| |
| 💡 Recommendations: |
| 1. Only 3/4 corners utilized. Place larger circles at unused corners. |
| 2. Detected 18.8% unused space. Consider increasing radii in sparse regions. |
| ``` |
|
|
| --- |
|
|
| ## 🚀 运行实验 |
|
|
| ### Step 1: 运行Baseline(如果还没有) |
|
|
| ```bash |
| cd /home/tengxiao/pj/ShinkaEvolve |
| source .venv/bin/activate |
| |
| # 运行baseline |
| python my/run_circle_packing_WITHOUT_vision.py |
| ``` |
|
|
| **预期时间**:根据你的设置,可能需要几小时到几天 |
|
|
| ### Step 2: 运行Aux Only |
|
|
| ```bash |
| # 运行auxiliary metrics版本 |
| python my/run_circle_packing_WITHOUT_vision_WITH_auxiliary.py |
| ``` |
|
|
| **预期时间**:与baseline相同(auxiliary计算很快) |
|
|
| ### Step 3: 对比结果 |
|
|
| ```bash |
| # 查看两个实验的结果 |
| ls -lh examples/circle_packing/results/ |
| ``` |
|
|
| --- |
|
|
| ## 📈 评估指标 |
|
|
| ### 主要指标 |
|
|
| 1. **最终最佳分数** |
| ```bash |
| # Baseline |
| cat examples/circle_packing/results/results_circle_packing_WITHOUT_vision_*/best/results/metrics.json | grep combined_score |
| |
| # Aux Only |
| cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_aux_*/best/results/metrics.json | grep combined_score |
| ``` |
|
|
| 2. **收敛速度** |
| - 查看每个generation的best score |
| - 绘制学习曲线 |
| - 看哪个更快达到高分 |
|
|
| 3. **最终排名** |
| ```python |
| # 从数据库查询最佳程序 |
| from shinka.database import ProgramDatabase |
| |
| db_baseline = ProgramDatabase(config=..., db_path="baseline.sqlite") |
| db_aux = ProgramDatabase(config=..., db_path="aux.sqlite") |
| |
| best_baseline = db_baseline.get_top_programs(n=1)[0] |
| best_aux = db_aux.get_top_programs(n=1)[0] |
| |
| print(f"Baseline best: {best_baseline.combined_score:.4f}") |
| print(f"Aux best: {best_aux.combined_score:.4f}") |
| print(f"Improvement: {(best_aux.combined_score - best_baseline.combined_score):.4f}") |
| ``` |
|
|
| ### 次要指标 |
|
|
| 1. **多样性** |
| - Archive中程序的多样性 |
| - 是否探索了更多不同的策略 |
|
|
| 2. **稳定性** |
| - 分数的方差 |
| - 是否更稳定地进步 |
|
|
| 3. **辅助指标的相关性**(仅Aux Only) |
| ```python |
| # 分析auxiliary metrics与primary score的相关性 |
| import pandas as pd |
| import matplotlib.pyplot as plt |
| |
| # 读取所有generation的metrics |
| # 绘制scatter plots |
| # 看哪些auxiliary metrics最有预测性 |
| ``` |
|
|
| --- |
|
|
| ## 📊 预期结果 |
|
|
| ### 如果Auxiliary Metrics有效 |
|
|
| **预期观察**: |
| ``` |
| Baseline: 最佳分数 = 2.45 |
| Aux Only: 最佳分数 = 2.55 ✅ 提升 ~4% |
| |
| 收敛曲线: |
| Baseline: 较慢,plateau更早 |
| Aux Only: 较快,持续改进 |
| |
| LLM行为: |
| Baseline: 随机探索,缺乏方向 |
| Aux Only: 针对性改进(如"improve edge_utilization") |
| ``` |
|
|
| ### 如果效果不明显 |
|
|
| **可能原因**: |
| 1. Auxiliary metrics与primary score不相关 |
| 2. LLM没有有效利用auxiliary信息 |
| 3. 需要调整metric权重或feedback格式 |
|
|
| **下一步**: |
| - 分析哪些auxiliary metrics最有用 |
| - 调整text feedback的表述 |
| - 考虑更强的auxiliary signal |
|
|
| --- |
|
|
| ## 🔍 详细分析脚本 |
|
|
| ### 比较最佳解决方案 |
|
|
| ```python |
| import json |
| from pathlib import Path |
| |
| # 读取两个实验的最佳结果 |
| baseline_metrics = json.load(open("results_baseline/best/results/metrics.json")) |
| aux_metrics = json.load(open("results_aux/best/results/metrics.json")) |
| |
| print("=" * 60) |
| print("COMPARISON: Baseline vs Aux Only") |
| print("=" * 60) |
| |
| print(f"\nPrimary Score:") |
| print(f" Baseline: {baseline_metrics['combined_score']:.4f}") |
| print(f" Aux Only: {aux_metrics['combined_score']:.4f}") |
| print(f" Δ: {aux_metrics['combined_score'] - baseline_metrics['combined_score']:.4f}") |
| |
| if 'public' in aux_metrics: |
| print(f"\nAuxiliary Metrics (Aux Only):") |
| for key, value in aux_metrics['public'].items(): |
| if key.startswith('aux_'): |
| print(f" {key}: {value:.3f}" if isinstance(value, float) else f" {key}: {value}") |
| ``` |
|
|
| ### 绘制学习曲线 |
|
|
| ```python |
| import matplotlib.pyplot as plt |
| import sqlite3 |
| |
| def get_best_scores_per_gen(db_path): |
| conn = sqlite3.connect(db_path) |
| cursor = conn.cursor() |
| |
| cursor.execute(""" |
| SELECT generation, MAX(combined_score) as best_score |
| FROM programs |
| WHERE correct = 1 |
| GROUP BY generation |
| ORDER BY generation |
| """) |
| |
| data = cursor.fetchall() |
| conn.close() |
| |
| return [row[0] for row in data], [row[1] for row in data] |
| |
| # 获取数据 |
| gens_baseline, scores_baseline = get_best_scores_per_gen("baseline.sqlite") |
| gens_aux, scores_aux = get_best_scores_per_gen("aux.sqlite") |
| |
| # 绘图 |
| plt.figure(figsize=(12, 6)) |
| plt.plot(gens_baseline, scores_baseline, label="Baseline (No Aux)", marker='o', alpha=0.7) |
| plt.plot(gens_aux, scores_aux, label="Aux Only", marker='s', alpha=0.7) |
| plt.xlabel("Generation") |
| plt.ylabel("Best Combined Score") |
| plt.title("Learning Curves: Baseline vs Auxiliary Metrics") |
| plt.legend() |
| plt.grid(True, alpha=0.3) |
| plt.savefig("learning_curves_comparison.png", dpi=150) |
| print("Saved: learning_curves_comparison.png") |
| ``` |
|
|
| --- |
|
|
| ## 🎯 成功标准 |
|
|
| ### 最小成功标准 |
|
|
| - [ ] Aux Only 最佳分数 > Baseline 最佳分数 |
| - [ ] 统计显著性(p < 0.05,如果运行多次重复) |
|
|
| ### 理想成功标准 |
|
|
| - [ ] Aux Only 提升 > 5% |
| - [ ] 收敛速度提升 > 20% |
| - [ ] 辅助指标与primary score有明显相关性 |
|
|
| ### 额外洞察 |
|
|
| - [ ] 识别出最有用的auxiliary metrics |
| - [ ] 发现LLM如何利用auxiliary信息 |
| - [ ] 验证programmatic gap detection的效果 |
|
|
| --- |
|
|
| ## 📝 实验日志模板 |
|
|
| ```markdown |
| # Experiment Log |
| |
| ## Baseline (WITHOUT vision, WITHOUT aux) |
| - Start: YYYY-MM-DD HH:MM |
| - End: YYYY-MM-DD HH:MM |
| - Best Score: X.XXXX |
| - Notes: ... |
| |
| ## Aux Only (WITHOUT vision, WITH aux) |
| - Start: YYYY-MM-DD HH:MM |
| - End: YYYY-MM-DD HH:MM |
| - Best Score: X.XXXX |
| - Improvement over Baseline: +X.XXXX (+X.X%) |
| - Notes: ... |
| |
| ## Key Observations |
| 1. ... |
| 2. ... |
| |
| ## Auxiliary Metrics Analysis |
| - Most useful metrics: ... |
| - Correlations: ... |
| - LLM behavior changes: ... |
| |
| ## Conclusions |
| - Auxiliary metrics效果: [有效/无效/部分有效] |
| - 下一步: ... |
| ``` |
|
|
| --- |
|
|
| ## 🔮 后续实验(如果Aux有效) |
|
|
| ### Phase 2: 完整2x2矩阵 |
|
|
| ```bash |
| # 1. WITH vision + WITHOUT aux (已有) |
| python my/run_circle_packing_WITH_vision.py |
| |
| # 2. WITH vision + WITH aux (新建) |
| # 创建这个版本来测试vision + auxiliary的组合效果 |
| ``` |
|
|
| ### Phase 3: 参数调优 |
|
|
| - 调整auxiliary metrics权重 |
| - 优化text feedback格式 |
| - 尝试不同的metric组合 |
|
|
| ### Phase 4: LLM生成Metrics |
|
|
| - 让LLM提出新的auxiliary metrics |
| - 自动筛选有用的metrics |
| - Co-evolution |
|
|
| --- |
|
|
| ## 💡 Pro Tips |
|
|
| ### 1. 先跑短实验验证 |
|
|
| ```python |
| # 修改num_generations = 20 做快速测试 |
| num_generations = 20 # Instead of 200 |
| ``` |
|
|
| **目的**:快速验证系统工作正常 |
|
|
| ### 2. 监控进度 |
|
|
| ```bash |
| # 实时查看最新generation的分数 |
| watch -n 60 'tail -20 examples/circle_packing/results/results_*/evolution_run.log | grep "best program"' |
| ``` |
|
|
| ### 3. 中期检查 |
|
|
| ```bash |
| # 50代后检查趋势 |
| python -c " |
| from shinka.database import ProgramDatabase, DatabaseConfig |
| db = ProgramDatabase(config=DatabaseConfig(...), db_path='...') |
| db.print_summary() |
| " |
| ``` |
|
|
| ### 4. 保存检查点 |
|
|
| ```bash |
| # 定期备份数据库 |
| cp evolution_db.sqlite evolution_db_backup_gen50.sqlite |
| ``` |
|
|
| --- |
|
|
| ## ✅ Checklist |
|
|
| ### 开始前 |
| - [ ] 确认baseline脚本存在 |
| - [ ] 确认aux脚本创建成功 |
| - [ ] 确认auxiliary eval系统测试通过 |
| - [ ] 确认有足够的磁盘空间(~1GB per run) |
| - [ ] 确认有足够的时间(可能数小时) |
|
|
| ### 运行中 |
| - [ ] Baseline已启动 |
| - [ ] Aux Only已启动(可并行或串行) |
| - [ ] 监控日志确认正常运行 |
| - [ ] 检查auxiliary_analysis.json正确生成(Aux Only) |
| |
| ### 完成后 |
| - [ ] 两个实验都成功完成 |
| - [ ] 收集最佳分数 |
| - [ ] 绘制学习曲线 |
| - [ ] 分析auxiliary metrics相关性 |
| - [ ] 记录实验日志 |
| - [ ] 得出结论 |
| |
| --- |
| |
| ## 📚 相关文件 |
| |
| - `run_circle_packing_WITHOUT_vision.py` - Baseline |
| - `run_circle_packing_WITHOUT_vision_WITH_auxiliary.py` - Aux Only |
| - `examples/circle_packing/auxiliary_eval.py` - Auxiliary metrics实现 |
| - `examples/circle_packing/evaluate_with_auxiliary.py` - 集成evaluator |
| - `AUXILIARY_EVAL_README.md` - 完整文档 |
|
|
| --- |
|
|
| **Good luck with your ablation study! 🚀** |
|
|
| 这是一个非常clean的实验设计,应该能清楚地证明auxiliary metrics的价值。 |
|
|