# 🧪 Auxiliary Metrics Ablation Study Guide ## 实验设计:2x2 因子实验 ### 完整实验矩阵 | 实验组 | Vision | Auxiliary | 脚本文件 | 目的 | |--------|--------|-----------|----------|------| | **Baseline** | ❌ | ❌ | `run_circle_packing_WITHOUT_vision.py` | 基准线 | | **Aux Only** | ❌ | ✅ | `run_circle_packing_WITHOUT_vision_WITH_auxiliary.py` | **关键对比** | | **Vision Only** | ✅ | ❌ | `run_circle_packing_WITH_vision.py` | Vision效果 | | **Both** | ✅ | ✅ | (待创建) | 最优组合 | --- ## 🎯 关键对比:Baseline vs Aux Only 这是**最重要的对比**,因为它是**纯净的ablation**: ``` Baseline: NO vision + NO auxiliary Aux Only: NO vision + WITH auxiliary 唯一差异:auxiliary metrics ``` **如果Aux Only > Baseline,则证明auxiliary metrics有效!** --- ## 📊 实验配置对比 ### 相同部分(确保公平对比) ```python # 两个实验完全相同: num_generations = 200 max_parallel_jobs = 4 num_islands = 2 archive_size = 40 llm_models = ["native-gemini-2.5-flash", "native-gemini-2.5-pro"] temperatures = [0.5, 0.7, 1.0] # ... 所有其他超参数 ``` ### 不同部分(唯一变量) #### Baseline (WITHOUT auxiliary) ```python job_config = LocalJobConfig( eval_program_path="examples/circle_packing/evaluate.py" # Ground truth only ) # LLM看到: Combined score: 2.456 centers_str: (0.123, 0.456), ... ``` #### Aux Only (WITH auxiliary) ```python job_config = LocalJobConfig( eval_program_path="examples/circle_packing/evaluate_with_auxiliary.py" # + Auxiliary ) # LLM看到: Combined score: 2.456 aux_spatial_uniformity: 0.752 aux_edge_utilization: 0.681 aux_density_variance: 0.694 aux_packing_efficiency: 0.734 aux_gap_analysis: 0.812 aux_geometric_quality: 0.778 💡 Recommendations: 1. Only 3/4 corners utilized. Place larger circles at unused corners. 2. Detected 18.8% unused space. Consider increasing radii in sparse regions. ``` --- ## 🚀 运行实验 ### Step 1: 运行Baseline(如果还没有) ```bash cd /home/tengxiao/pj/ShinkaEvolve source .venv/bin/activate # 运行baseline python my/run_circle_packing_WITHOUT_vision.py ``` **预期时间**:根据你的设置,可能需要几小时到几天 ### Step 2: 运行Aux Only ```bash # 运行auxiliary metrics版本 python my/run_circle_packing_WITHOUT_vision_WITH_auxiliary.py ``` **预期时间**:与baseline相同(auxiliary计算很快) ### Step 3: 对比结果 ```bash # 查看两个实验的结果 ls -lh examples/circle_packing/results/ ``` --- ## 📈 评估指标 ### 主要指标 1. **最终最佳分数** ```bash # Baseline cat examples/circle_packing/results/results_circle_packing_WITHOUT_vision_*/best/results/metrics.json | grep combined_score # Aux Only cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_aux_*/best/results/metrics.json | grep combined_score ``` 2. **收敛速度** - 查看每个generation的best score - 绘制学习曲线 - 看哪个更快达到高分 3. **最终排名** ```python # 从数据库查询最佳程序 from shinka.database import ProgramDatabase db_baseline = ProgramDatabase(config=..., db_path="baseline.sqlite") db_aux = ProgramDatabase(config=..., db_path="aux.sqlite") best_baseline = db_baseline.get_top_programs(n=1)[0] best_aux = db_aux.get_top_programs(n=1)[0] print(f"Baseline best: {best_baseline.combined_score:.4f}") print(f"Aux best: {best_aux.combined_score:.4f}") print(f"Improvement: {(best_aux.combined_score - best_baseline.combined_score):.4f}") ``` ### 次要指标 1. **多样性** - Archive中程序的多样性 - 是否探索了更多不同的策略 2. **稳定性** - 分数的方差 - 是否更稳定地进步 3. **辅助指标的相关性**(仅Aux Only) ```python # 分析auxiliary metrics与primary score的相关性 import pandas as pd import matplotlib.pyplot as plt # 读取所有generation的metrics # 绘制scatter plots # 看哪些auxiliary metrics最有预测性 ``` --- ## 📊 预期结果 ### 如果Auxiliary Metrics有效 **预期观察**: ``` Baseline: 最佳分数 = 2.45 Aux Only: 最佳分数 = 2.55 ✅ 提升 ~4% 收敛曲线: Baseline: 较慢,plateau更早 Aux Only: 较快,持续改进 LLM行为: Baseline: 随机探索,缺乏方向 Aux Only: 针对性改进(如"improve edge_utilization") ``` ### 如果效果不明显 **可能原因**: 1. Auxiliary metrics与primary score不相关 2. LLM没有有效利用auxiliary信息 3. 需要调整metric权重或feedback格式 **下一步**: - 分析哪些auxiliary metrics最有用 - 调整text feedback的表述 - 考虑更强的auxiliary signal --- ## 🔍 详细分析脚本 ### 比较最佳解决方案 ```python import json from pathlib import Path # 读取两个实验的最佳结果 baseline_metrics = json.load(open("results_baseline/best/results/metrics.json")) aux_metrics = json.load(open("results_aux/best/results/metrics.json")) print("=" * 60) print("COMPARISON: Baseline vs Aux Only") print("=" * 60) print(f"\nPrimary Score:") print(f" Baseline: {baseline_metrics['combined_score']:.4f}") print(f" Aux Only: {aux_metrics['combined_score']:.4f}") print(f" Δ: {aux_metrics['combined_score'] - baseline_metrics['combined_score']:.4f}") if 'public' in aux_metrics: print(f"\nAuxiliary Metrics (Aux Only):") for key, value in aux_metrics['public'].items(): if key.startswith('aux_'): print(f" {key}: {value:.3f}" if isinstance(value, float) else f" {key}: {value}") ``` ### 绘制学习曲线 ```python import matplotlib.pyplot as plt import sqlite3 def get_best_scores_per_gen(db_path): conn = sqlite3.connect(db_path) cursor = conn.cursor() cursor.execute(""" SELECT generation, MAX(combined_score) as best_score FROM programs WHERE correct = 1 GROUP BY generation ORDER BY generation """) data = cursor.fetchall() conn.close() return [row[0] for row in data], [row[1] for row in data] # 获取数据 gens_baseline, scores_baseline = get_best_scores_per_gen("baseline.sqlite") gens_aux, scores_aux = get_best_scores_per_gen("aux.sqlite") # 绘图 plt.figure(figsize=(12, 6)) plt.plot(gens_baseline, scores_baseline, label="Baseline (No Aux)", marker='o', alpha=0.7) plt.plot(gens_aux, scores_aux, label="Aux Only", marker='s', alpha=0.7) plt.xlabel("Generation") plt.ylabel("Best Combined Score") plt.title("Learning Curves: Baseline vs Auxiliary Metrics") plt.legend() plt.grid(True, alpha=0.3) plt.savefig("learning_curves_comparison.png", dpi=150) print("Saved: learning_curves_comparison.png") ``` --- ## 🎯 成功标准 ### 最小成功标准 - [ ] Aux Only 最佳分数 > Baseline 最佳分数 - [ ] 统计显著性(p < 0.05,如果运行多次重复) ### 理想成功标准 - [ ] Aux Only 提升 > 5% - [ ] 收敛速度提升 > 20% - [ ] 辅助指标与primary score有明显相关性 ### 额外洞察 - [ ] 识别出最有用的auxiliary metrics - [ ] 发现LLM如何利用auxiliary信息 - [ ] 验证programmatic gap detection的效果 --- ## 📝 实验日志模板 ```markdown # Experiment Log ## Baseline (WITHOUT vision, WITHOUT aux) - Start: YYYY-MM-DD HH:MM - End: YYYY-MM-DD HH:MM - Best Score: X.XXXX - Notes: ... ## Aux Only (WITHOUT vision, WITH aux) - Start: YYYY-MM-DD HH:MM - End: YYYY-MM-DD HH:MM - Best Score: X.XXXX - Improvement over Baseline: +X.XXXX (+X.X%) - Notes: ... ## Key Observations 1. ... 2. ... ## Auxiliary Metrics Analysis - Most useful metrics: ... - Correlations: ... - LLM behavior changes: ... ## Conclusions - Auxiliary metrics效果: [有效/无效/部分有效] - 下一步: ... ``` --- ## 🔮 后续实验(如果Aux有效) ### Phase 2: 完整2x2矩阵 ```bash # 1. WITH vision + WITHOUT aux (已有) python my/run_circle_packing_WITH_vision.py # 2. WITH vision + WITH aux (新建) # 创建这个版本来测试vision + auxiliary的组合效果 ``` ### Phase 3: 参数调优 - 调整auxiliary metrics权重 - 优化text feedback格式 - 尝试不同的metric组合 ### Phase 4: LLM生成Metrics - 让LLM提出新的auxiliary metrics - 自动筛选有用的metrics - Co-evolution --- ## 💡 Pro Tips ### 1. 先跑短实验验证 ```python # 修改num_generations = 20 做快速测试 num_generations = 20 # Instead of 200 ``` **目的**:快速验证系统工作正常 ### 2. 监控进度 ```bash # 实时查看最新generation的分数 watch -n 60 'tail -20 examples/circle_packing/results/results_*/evolution_run.log | grep "best program"' ``` ### 3. 中期检查 ```bash # 50代后检查趋势 python -c " from shinka.database import ProgramDatabase, DatabaseConfig db = ProgramDatabase(config=DatabaseConfig(...), db_path='...') db.print_summary() " ``` ### 4. 保存检查点 ```bash # 定期备份数据库 cp evolution_db.sqlite evolution_db_backup_gen50.sqlite ``` --- ## ✅ Checklist ### 开始前 - [ ] 确认baseline脚本存在 - [ ] 确认aux脚本创建成功 - [ ] 确认auxiliary eval系统测试通过 - [ ] 确认有足够的磁盘空间(~1GB per run) - [ ] 确认有足够的时间(可能数小时) ### 运行中 - [ ] Baseline已启动 - [ ] Aux Only已启动(可并行或串行) - [ ] 监控日志确认正常运行 - [ ] 检查auxiliary_analysis.json正确生成(Aux Only) ### 完成后 - [ ] 两个实验都成功完成 - [ ] 收集最佳分数 - [ ] 绘制学习曲线 - [ ] 分析auxiliary metrics相关性 - [ ] 记录实验日志 - [ ] 得出结论 --- ## 📚 相关文件 - `run_circle_packing_WITHOUT_vision.py` - Baseline - `run_circle_packing_WITHOUT_vision_WITH_auxiliary.py` - Aux Only - `examples/circle_packing/auxiliary_eval.py` - Auxiliary metrics实现 - `examples/circle_packing/evaluate_with_auxiliary.py` - 集成evaluator - `AUXILIARY_EVAL_README.md` - 完整文档 --- **Good luck with your ablation study! 🚀** 这是一个非常clean的实验设计,应该能清楚地证明auxiliary metrics的价值。