| # Eval Service 动态生成 Metrics 的完整流程分析 |
|
|
| ## 🔍 整体架构 |
|
|
| ``` |
| ┌─────────────────────────────────────────────────────────────────┐ |
| │ ShinkaEvolve Evolution Loop │ |
| │ 1. 运行程序 (gen_X/main.py) │ |
| │ 2. 评估 (evaluate.py) → metrics.json │ |
| │ 3. 通知 Eval Service → "新的 generation 完成" │ |
| └─────────────────────────────────────────────────────────────────┘ |
| ↓ HTTP POST |
| ┌─────────────────────────────────────────────────────────────────┐ |
| │ Eval Service (ev2_service_standalone.py) │ |
| │ 1. 接收通知 (generation, score, results_dir) │ |
| │ 2. 决策:是否触发 agent? │ |
| │ 3. YES → 启动 IntegratedEV2Agent │ |
| └─────────────────────────────────────────────────────────────────┘ |
| ↓ 如果触发 |
| ┌─────────────────────────────────────────────────────────────────┐ |
| │ IntegratedEV2Agent (OpenHands Agent + LLM) │ |
| │ 1. 分析演化历史 (读取 gen_*/results/metrics.json) │ |
| │ 2. 识别 primary metric 未涵盖的方面 │ |
| │ 3. 设计 auxiliary metrics (Python 函数) │ |
| │ 4. 生成代码:auxiliary_metrics.py │ |
| │ 5. 保存分析:EVAL_AGENTS.md │ |
| │ Workspace: <results_dir>/eval_agent_memory/ │ |
| └─────────────────────────────────────────────────────────────────┘ |
| ↓ 生成文件 |
| ┌─────────────────────────────────────────────────────────────────┐ |
| │ 输出文件 (修复后应在实验根目录下) │ |
| │ • eval_agent_memory/auxiliary_metrics.py ← LLM 生成的代码 │ |
| │ • eval_agent_memory/EVAL_AGENTS.md ← Agent 的分析记录 │ |
| │ • eval_agent_memory/service_state.json ← Service 状态 │ |
| └─────────────────────────────────────────────────────────────────┘ |
| ↓ |
| ┌─────────────────────────────────────────────────────────────────┐ |
| │ ❌ 目前:ShinkaEvolve 不自动使用这些动态生成的 metrics │ |
| │ ✅ 现有:预定义的 auxiliary_eval.py 系统可手动使用 │ |
| └─────────────────────────────────────────────────────────────────┘ |
| ``` |
|
|
| --- |
|
|
| ## 📊 Part 1: Eval Service 如何生成新的 Metrics |
|
|
| ### 1.1 触发机制 |
|
|
| **位置**: `eval_agent/ev2_service_standalone.py` |
|
|
| ```python |
| # ServiceState 决定何时触发 agent |
| def should_trigger_agent(self, generation: int, primary_score: float): |
| # 触发条件 (例子): |
| # - 每 N 代触发一次 |
| # - Score 出现 plateau (停滞) |
| # - 手动触发 |
| pass |
| ``` |
|
|
| **实际数据**: 在你的实验中 |
| - 总共 50 generations |
| - Agent 被触发了约 7 次 (gen_9, 20, 30, 31, 41, 42, 43) |
| - 触发间隔不规律,说明可能基于 score 变化或其他逻辑 |
| |
| ### 1.2 Agent 的工作流程 |
| |
| **核心文件**: `eval_agent/ev2.py` 的 `evolution_evaluation_agent()` |
|
|
| **步骤**: |
|
|
| #### Step 1: Agent 初始化 |
| ```python |
| # 创建 workspace |
| agent_workspace = Path(results_dir) / "eval_agent_memory" |
| |
| # 创建 OpenHands Agent (使用 LLM) |
| llm = LLM(model="vertex_ai/gemini-2.5-flash") |
| agent = Agent( |
| llm=llm, |
| tools=[TerminalTool, FileEditorTool, TaskTrackerTool], |
| system_prompt_filename="ev2_prompt.j2" # ← 关键 Prompt |
| ) |
| ``` |
|
|
| #### Step 2: 构建任务消息 |
| ```python |
| task_message = f""" |
| === Generation {current_gen} Evaluation === |
| |
| 📁 File Locations: |
| - Results directory: {results_dir} |
| - Current generation: {results_dir}/gen_{current_gen} |
| - All generations: gen_0/ through gen_{current_gen}/ |
| |
| 📊 Available Data: |
| - Evolution database: evolution_db_*.sqlite |
| - Each generation has: main.py and results/metrics.json |
| |
| ⚠️ PRIMARY EVALUATOR (FIXED - DO NOT MODIFY): |
| - Path: {primary_evaluator_path} |
| - You MUST NOT modify this evaluator |
| - You can READ it to understand what is being optimized |
| - Your job is to create AUXILIARY metrics that complement it |
| |
| 🎯 Your Specific Tasks: |
| 1. Analyze evolution progress up to generation {current_gen} |
| 2. Review performance trends from recent generations |
| 3. Identify what aspects are NOT being measured by primary metric |
| 4. Design 2-3 auxiliary metrics that would provide useful insights |
| 5. Implement these metrics as Python functions in your workspace |
| 6. Test metrics on current generation data |
| 7. Document findings and metric designs in EVAL_AGENTS.md |
| """ |
| ``` |
|
|
| #### Step 3: Agent 执行 |
| Agent 使用 tools 来: |
| - **TerminalTool**: 执行 Python 代码,测试 metrics |
| - **FileEditorTool**: 创建/编辑 `auxiliary_metrics.py` |
| - **TaskTrackerTool**: 跟踪任务进度 |
|
|
| Agent 会读取: |
| ```bash |
| # 读取历史数据 |
| gen_0/results/metrics.json |
| gen_1/results/metrics.json |
| ... |
| gen_{current_gen}/results/metrics.json |
| |
| # 读取 primary evaluator (理解优化目标) |
| examples/circle_packing/evaluate.py |
| |
| # 读取当前最佳代码 (理解当前策略) |
| gen_X/main.py # 当前最佳 generation |
| ``` |
|
|
| ### 1.3 生成的 Metrics 文件示例 |
|
|
| **文件**: `gen_9/results/eval_agent_memory/auxiliary_metrics.py` |
|
|
| ```python |
| import numpy as np |
| |
| def calculate_radius_std_dev(radii: np.ndarray) -> float: |
| """ |
| Calculates the standard deviation of circle radii. |
| A lower value indicates more uniform circle sizes. |
| """ |
| if len(radii) == 0: |
| return 0.0 |
| return np.std(radii) |
| |
| def calculate_nearest_neighbor_metrics(centers: np.ndarray) -> dict: |
| """ |
| Calculates the average and standard deviation of nearest neighbor |
| distances for circle centers. |
| """ |
| if len(centers) < 2: |
| return {"avg_nn_distance": 0.0, "std_nn_distance": 0.0} |
| |
| n = centers.shape[0] |
| min_distances = [] |
| |
| for i in range(n): |
| distances = [] |
| for j in range(n): |
| if i != j: |
| dist = np.sqrt(np.sum((centers[i] - centers[j]) ** 2)) |
| distances.append(dist) |
| if distances: |
| min_distances.append(min(distances)) |
| |
| return { |
| "avg_nn_distance": float(np.mean(min_distances)), |
| "std_nn_distance": float(np.std(min_distances)), |
| } |
| |
| def evaluate_auxiliary_metrics(centers: np.ndarray, radii: np.ndarray) -> dict: |
| """ |
| Combines all auxiliary metric calculations. |
| """ |
| radius_std_dev = calculate_radius_std_dev(radii) |
| nn_metrics = calculate_nearest_neighbor_metrics(centers) |
| |
| return { |
| "auxiliary_radius_std_dev": radius_std_dev, |
| "auxiliary_avg_nn_distance": nn_metrics["avg_nn_distance"], |
| "auxiliary_std_nn_distance": nn_metrics["std_nn_distance"], |
| } |
| ``` |
|
|
| **关键点**: |
| - ✅ Agent 自己**设计和实现**了 3 个新 metrics |
| - ✅ 这些 metrics 测量 primary metric (sum of radii) **未涵盖**的方面: |
| - 半径分布 (uniformity) |
| - 空间排列 (nearest neighbor) |
| - 分布均匀性 (spatial distribution) |
|
|
| ### 1.4 分析记录文件 |
|
|
| **文件**: `gen_9/results/eval_agent_memory/EVAL_AGENTS.md` |
|
|
| ```markdown |
| # Evaluation Agent Memory |
| |
| ## Generation 9 Auxiliary Metrics |
| |
| ### Designed Auxiliary Metrics: |
| |
| 1. **`auxiliary_radius_std_dev` (Radius Standard Deviation)** |
| - **Rationale:** The primary metric only considers the total sum of radii. |
| This metric provides insight into the *distribution* of those radii. |
| - **Expected Behavior:** A lower std dev suggests more uniform circles. |
| |
| 2. **`auxiliary_avg_nn_distance` (Average Nearest Neighbor Distance)** |
| - **Rationale:** Provides insight into spatial arrangement and density |
| beyond just total radius. |
| |
| ### Results for Generation 9: |
| - `combined_score`: 1.9814039364070457 |
| - `auxiliary_radius_std_dev`: 0.030866 |
| - `auxiliary_avg_nn_distance`: 0.145581 |
| - `auxiliary_std_nn_distance`: 0.054509 |
| |
| ### Diagnostics: |
| - The low `auxiliary_radius_std_dev` (0.030866) suggests uniform radii. |
| - The `auxiliary_avg_nn_distance` (0.145581) gives a sense of circle proximity. |
| |
| ### Recommendations: |
| - **Trend Analysis:** Track these auxiliary metrics over generations |
| - **Correlation with Primary Score:** Investigate correlations |
| - **Visualize Packings:** Visualize extreme values |
| ``` |
|
|
| **关键点**: |
| - 📝 Agent 记录了**设计思路、预期行为、实际结果、诊断分析** |
| - 📝 这是 agent 的**持久化记忆**,后续 generations 可以参考 |
|
|
| --- |
|
|
| ## 🔧 Part 2: ShinkaEvolve 如何使用这些 Metrics |
|
|
| ### 2.1 当前状态: **目前不使用动态生成的 metrics** ❌ |
|
|
| **核心问题**: 从代码搜索结果来看: |
| ```bash |
| # 在 shinka/core/*.py 中搜索 "auxiliary" 或 "aux_" |
| grep -r "auxiliary\|aux_" shinka/ |
| # 结果: 没有匹配 ❌ |
| ``` |
|
|
| **原因**: |
| 1. ShinkaEvolve 的 evaluation wrapper (`shinka/core/wrap_eval.py`) 只调用标准的 `aggregate_metrics_fn` |
| 2. 没有机制自动导入和调用 `eval_agent_memory/auxiliary_metrics.py` |
| 3. 动态生成的 metrics **仅用于 agent 分析**,不会影响演化过程 |
|
|
| ### 2.2 已有的 Auxiliary Metrics 系统 (手动) ✅ |
|
|
| **虽然动态 metrics 未被使用,但已经有一个手动的 auxiliary evaluation 系统**: |
|
|
| #### 文件结构: |
| ``` |
| examples/circle_packing/ |
| ├── evaluate.py # Ground truth (PRIMARY METRIC) |
| ├── auxiliary_eval.py # 预定义的 auxiliary metrics |
| ├── evaluate_with_auxiliary.py # Wrapper evaluator |
| └── AUXILIARY_EVAL_README.md |
| ``` |
|
|
| #### 手动 Auxiliary Metrics 系统: |
|
|
| **`auxiliary_eval.py`** 包含 7 个预定义 metrics: |
| |
| ```python |
| class AuxiliaryEvaluator: |
| def evaluate(self, centers, radii, primary_score): |
| # 1. Spatial Uniformity (Voronoi analysis) |
| # 2. Edge Utilization (boundary usage) |
| # 3. Density Variance (grid-based density) |
| # 4. Packing Efficiency (area ratio) |
| # 5. Radius Distribution (entropy) |
| # 6. Gap Analysis (uncovered areas) |
| # 7. Geometric Quality (Delaunay triangulation) |
| pass |
| ``` |
| |
| **使用方式**: |
| |
| ```bash |
| # 方式 1: 在实验配置中启用 (如果 ShinkaEvolve 支持) |
| python run.py --evaluator evaluate_with_auxiliary.py |
| |
| # 方式 2: 手动分析已有结果 |
| python evaluate_with_auxiliary.py \\ |
| --program_path gen_42/main.py \\ |
| --results_dir gen_42/results |
| ``` |
| |
| #### Auxiliary Metrics 保存格式: |
| |
| ```json |
| // gen_X/results/metrics.json |
| { |
| "combined_score": 2.34, // ← PRIMARY (ground truth) |
| "public": { |
| "num_circles": 26, |
| // Auxiliary metrics (if enabled): |
| "aux_spatial_uniformity": 0.85, |
| "aux_edge_utilization": 0.72, |
| "aux_density_variance": 0.91, |
| "aux_packing_efficiency": 0.78, |
| "aux_radius_distribution": 0.65, |
| "aux_gap_coverage": 0.88, |
| "aux_geometric_quality": 0.79 |
| }, |
| "private": {...} |
| } |
| ``` |
| |
| ### 2.3 Metrics 的访问路径 |
| |
| **ShinkaEvolve 如何读取 metrics**: |
| |
| ```python |
| # shinka/core/runner.py |
| def _process_completed_job(self, job: RunningJob): |
| # 1. 读取评估结果 |
| metrics_file = f"{job.results_dir}/metrics.json" |
| with open(metrics_file) as f: |
| metrics = json.load(f) |
| |
| # 2. 提取 primary score |
| combined_score = metrics["combined_score"] |
| |
| # 3. 存入数据库 |
| db_program = DBProgram( |
| id=job.job_id, |
| generation=job.generation, |
| combined_score=combined_score, # ← PRIMARY |
| public_metrics=metrics.get("public", {}), # ← 包含 auxiliary |
| private_metrics=metrics.get("private", {}), |
| # ... |
| ) |
| self.db.add(db_program) |
| ``` |
| |
| **关键点**: |
| - ✅ ShinkaEvolve **会保存** `public_metrics` 中的所有 auxiliary metrics |
| - ✅ 这些 metrics 会存入 SQLite database |
| - ❌ 但**演化决策**仅基于 `combined_score` (primary metric) |
| - ❓ LLM Agent 在生成新代码时**可能看到** auxiliary metrics (通过 `public_metrics`) |
|
|
| --- |
|
|
| ## 🔗 Part 3: 完整数据流 |
|
|
| ### 3.1 单个 Generation 的完整流程 |
|
|
| ``` |
| 1. ShinkaEvolve 生成代码 |
| └─> gen_42/main.py |
| |
| 2. 运行评估 (evaluate.py 或 evaluate_with_auxiliary.py) |
| ├─> 运行 main.py::run_packing() |
| ├─> 验证约束 (不重叠、在边界内) |
| ├─> 计算 primary score = sum(radii) |
| ├─> [可选] 计算 auxiliary metrics |
| └─> 保存 gen_42/results/metrics.json |
| { |
| "combined_score": 2.34, ← PRIMARY (决定演化) |
| "public": { |
| "num_circles": 26, |
| "aux_*": ... ← AUXILIARY (信息性) |
| } |
| } |
| |
| 3. ShinkaEvolve 读取结果 |
| ├─> 读取 metrics.json |
| ├─> 提取 combined_score → 决定是否为"更好的解" |
| ├─> 保存到 database (包括 public_metrics) |
| └─> **演化决策仅基于 combined_score** |
| |
| 4. [并行] 通知 Eval Service |
| └─> HTTP POST /api/v1/notify/generation_complete |
| { |
| "generation": 42, |
| "primary_score": 2.34, |
| "results_dir": "<experiment_root>" |
| } |
| |
| 5. [异步] Eval Service 决策 |
| ├─> 判断: 是否触发 agent? |
| └─> YES → 启动 IntegratedEV2Agent |
| ├─> 分析 gen_0 到 gen_42 的历史 |
| ├─> 设计新的 auxiliary metrics |
| ├─> 生成 auxiliary_metrics.py |
| ├─> 保存 EVAL_AGENTS.md |
| └─> [目前] 这些文件不会被 ShinkaEvolve 自动使用 |
| ``` |
|
|
| ### 3.2 当前的 Gap (差距) |
|
|
| ``` |
| ┌─────────────────────────────────────┐ |
| │ Eval Agent 生成的 Metrics │ |
| │ (auxiliary_metrics.py) │ |
| │ • 动态适应演化阶段 │ |
| │ • LLM 设计的创新 metrics │ |
| │ • 保存在 eval_agent_memory/ │ |
| └─────────────────────────────────────┘ |
| ❌ 没有桥接 |
| ┌─────────────────────────────────────┐ |
| │ ShinkaEvolve Evolution Loop │ |
| │ • 只使用 evaluator 返回的 metrics │ |
| │ • 决策基于 combined_score │ |
| │ • 不会导入动态生成的代码 │ |
| └─────────────────────────────────────┘ |
| ``` |
|
|
| --- |
|
|
| ## 💡 Part 4: 潜在的集成方案 |
|
|
| ### 方案 A: 动态导入 Agent 生成的 Metrics |
|
|
| ```python |
| # 在 evaluate_with_auxiliary.py 中添加: |
| |
| def load_dynamic_metrics(results_dir: str): |
| """Load dynamically generated metrics from eval agent.""" |
| aux_metrics_path = Path(results_dir) / "eval_agent_memory" / "auxiliary_metrics.py" |
| |
| if not aux_metrics_path.exists(): |
| return None |
| |
| # 动态导入 |
| import importlib.util |
| spec = importlib.util.spec_from_file_location("dynamic_aux", aux_metrics_path) |
| module = importlib.util.module_from_spec(spec) |
| spec.loader.exec_module(module) |
| |
| # 假设模块有标准接口 |
| if hasattr(module, 'evaluate_auxiliary_metrics'): |
| return module.evaluate_auxiliary_metrics |
| |
| return None |
| |
| # 在 evaluate 时调用: |
| dynamic_eval_fn = load_dynamic_metrics(results_dir) |
| if dynamic_eval_fn: |
| dynamic_metrics = dynamic_eval_fn(centers, radii) |
| metrics["public"].update(dynamic_metrics) |
| ``` |
|
|
| ### 方案 B: Agent 直接更新 Evaluator 配置 |
|
|
| ```python |
| # Agent 生成 auxiliary_config.json |
| { |
| "enabled_metrics": [ |
| "spatial_uniformity", |
| "radius_std_dev", # ← Agent 新发现的重要 metric |
| "nearest_neighbor_dist" # ← Agent 新发现的重要 metric |
| ], |
| "metric_weights": { |
| "spatial_uniformity": 0.3, |
| "radius_std_dev": 0.4, |
| "nearest_neighbor_dist": 0.3 |
| } |
| } |
| |
| # evaluate_with_auxiliary.py 读取此配置 |
| config = AuxiliaryEvalConfig.from_json("eval_agent_memory/auxiliary_config.json") |
| ``` |
|
|
| ### 方案 C: Agent 作为 Meta-Evaluator |
|
|
| ```python |
| # Agent 定期生成 evaluation report |
| # eval_agent_memory/evaluation_report.json |
| { |
| "generation": 42, |
| "primary_score": 2.34, |
| "stage_diagnosis": "plateau", # Agent 的诊断 |
| "recommended_focus": [ |
| "Improve corner utilization", |
| "Reduce radius variance", |
| "Explore hexagonal patterns" |
| ], |
| "auxiliary_scores": { |
| "uniformity": 0.85, |
| "efficiency": 0.78 |
| } |
| } |
| |
| # ShinkaEvolve 的 mutation agent 读取此 report |
| # 调整 mutation 策略 |
| ``` |
|
|
| --- |
|
|
| ## 📋 Part 5: 总结 |
|
|
| ### 当前实现状态 |
|
|
| | 组件 | 状态 | 说明 | |
| |------|------|------| |
| | **Eval Service** | ✅ 实现 | 接收通知,触发 agent | |
| | **Agent 生成 Metrics** | ✅ 实现 | 动态创建 auxiliary_metrics.py | |
| | **Agent 分析记录** | ✅ 实现 | EVAL_AGENTS.md 持久化记忆 | |
| | **手动 Auxiliary System** | ✅ 实现 | auxiliary_eval.py (7个预定义metrics) | |
| | **ShinkaEvolve 使用动态 Metrics** | ❌ 未实现 | 没有自动导入机制 | |
| | **路径问题** | ✅ 已修复 | eval_agent_memory 现在在正确位置 | |
| |
| ### 关键发现 |
| |
| 1. **两套 Auxiliary Metrics 系统**: |
| - **动态系统** (eval agent 生成): 未被使用,仅用于分析 |
| - **预定义系统** (auxiliary_eval.py): 可手动启用 |
|
|
| 2. **演化决策**: |
| - 完全基于 `combined_score` (primary metric) |
| - Auxiliary metrics 仅作为**观察信号**保存在 database |
|
|
| 3. **Agent 的价值**: |
| - 当前主要用于**离线分析**和**生成 insights** |
| - 生成的代码需要**人工审查和集成** |
|
|
| ### 下一步行动建议 |
|
|
| 1. **短期** (实现动态 metrics 自动使用): |
| - 修改 `evaluate_with_auxiliary.py` 支持动态导入 |
| - 在实验配置中启用 auxiliary evaluation |
|
|
| 2. **中期** (闭环集成): |
| - Agent 生成的 insights → Mutation prompts |
| - Agent 诊断 → 自适应策略调整 |
|
|
| 3. **长期** (完全自主 evaluation): |
| - Agent 自动设计和测试新 metrics |
| - Metrics 自动纳入演化决策 |
| - 多目标优化 (primary + weighted auxiliary) |
|
|