File size: 25,951 Bytes
1556404
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
# Eval Service 重新设计方案(通用 + 职责清晰)

## 🎯 核心问题

### 当前设计的问题
1. **❌ 过于针对 circle packing**: 假设了 centers/radii 特定数据格式
2. **❌ 职责不清**: ShinkaEvolve 负责评估,Eval Service 只是"旁观者"
3. **❌ 重复工作**: 如果要运行 auxiliary metrics,需要重新加载程序输出

### 理想设计
1. **✅ 任务无关**: 适用于任何优化问题(TSP, circuit design, code optimization, etc.)
2. **✅ 职责清晰**: Eval Service **完全负责**评估(primary + auxiliary)
3. **✅ 一次性完成**: 一次运行完成所有评估,避免重复

---

## 🏗️ 重新设计的架构

### 新的职责划分

```
┌─────────────────────────────────────────────────────────┐
│              ShinkaEvolve (Evolution Loop)              │
│  职责:                                                  │
│  • 生成新代码 (mutation/crossover)                       │
│  • 管理演化流程 (parent selection, population, etc.)    │
│  • 存储结果到 database                                   │
│  • 决策下一代的演化策略                                  │
│                                                          │
│  不负责:❌ 运行和评估程序                               │
└─────────────────────────────────────────────────────────┘
                    ↓ 提交代码
┌─────────────────────────────────────────────────────────┐
│            Eval Service (Evaluation Service)            │
│  职责:                                                  │
│  • 接收待评估的程序代码                                  │
│  • 运行 PRIMARY evaluator (ground truth)                │
│  • 运行 AUXILIARY evaluators (if available)             │
│  • 生成完整的 metrics.json                               │
│  • [可选] 触发 agent 更新 auxiliary evaluators          │
│                                                          │
│  输出:完整的 evaluation results                         │
└─────────────────────────────────────────────────────────┘
                    ↓ 返回结果
┌─────────────────────────────────────────────────────────┐
│              ShinkaEvolve (继续演化)                     │
│  • 读取 metrics.json                                     │
│  • 提取 combined_score → 演化决策                       │
│  • [可选] 读取 auxiliary metrics → 用于 LLM prompt      │
└─────────────────────────────────────────────────────────┘
```

---

## 📊 通用接口设计

### 核心原则:Evaluator Contract(评估器契约)

**任何任务的 evaluator 必须遵循标准接口**:

```python
# Standard Evaluator Interface (任务无关)

def evaluate(
    program_path: str,
    results_dir: str,
    **kwargs
) -> EvaluationResult:
    """
    标准评估接口
    
    Args:
        program_path: 待评估程序的路径
        results_dir: 结果保存目录
        **kwargs: 任务特定的额外参数
    
    Returns:
        EvaluationResult: 包含 primary metrics, auxiliary metrics, metadata
    """
    pass


@dataclass
class EvaluationResult:
    """Standard evaluation result (任务无关)"""
    
    # PRIMARY (必须,决定演化)
    combined_score: float          # Ground truth score
    correct: bool                  # 是否通过验证
    error: Optional[str]           # 如果失败,错误信息
    
    # PUBLIC METRICS (可选,可见给 LLM)
    public_metrics: Dict[str, Any]
    
    # PRIVATE METRICS (可选,不可见给 LLM)
    private_metrics: Dict[str, Any]
    
    # AUXILIARY METRICS (可选,由 eval agent 动态生成)
    auxiliary_metrics: Dict[str, Any]
    auxiliary_definitions: Dict[str, MetricDefinition]
    
    # PROGRAM OUTPUT (可选,用于后续分析/可视化)
    program_output: Any            # 任务特定的输出
    
    # METADATA
    execution_time: float
    num_runs: int
    timestamp: str


@dataclass
class MetricDefinition:
    """Metric definition (任务无关)"""
    name: str
    description: str
    interpretation: str  # "higher_better", "lower_better", "neutral"
    unit: str
    formula: Optional[str]
    source: str  # "primary", "auxiliary_static", "auxiliary_dynamic"
```

---

## 🔧 Eval Service 重新实现

### 新的 API 设计

#### 1. Evaluation Request (同步/异步)

```python
# POST /api/v1/evaluate
{
  "program_path": "gen_42/main.py",          # 相对于 experiment root
  "results_dir": "gen_42/results",           # 相对于 experiment root
  "generation": 42,
  "experiment_root": "/path/to/results_...", # 实验根目录
  "evaluation_config": {
    "primary_evaluator": "examples/circle_packing/evaluate.py",
    "num_runs": 1,
    "timeout": 300,  # seconds
    "extra_args": {}
  },
  "auxiliary_config": {
    "enabled": true,
    "use_dynamic": true,  # 使用 agent 生成的 metrics
    "use_static": false,  # 使用预定义的 metrics
    "timeout": 10         # per metric
  }
}

# Response (如果异步)
{
  "status": "accepted",
  "job_id": "eval_job_42",
  "estimated_time": 15  # seconds
}

# 或者 (如果同步)
{
  "status": "completed",
  "evaluation_result": {...}  # EvaluationResult
}
```

#### 2. Query Evaluation Result

```python
# GET /api/v1/evaluate/{job_id}
{
  "status": "completed",  # "pending", "running", "completed", "failed"
  "evaluation_result": {
    "combined_score": 2.34,
    "correct": true,
    "error": null,
    "public_metrics": {
      "num_circles": 26,
      "aux_radius_std_dev": 0.031,
      "aux_spatial_uniformity": 0.82
    },
    "private_metrics": {...},
    "auxiliary_metrics": {...},
    "auxiliary_definitions": {...},
    "program_output": {...},  # task-specific
    "execution_time": 12.5,
    "timestamp": "2026-02-03T04:00:00Z"
  }
}
```

---

## 🔄 完整流程(新架构)

### Step 1: ShinkaEvolve 提交评估请求

```python
# shinka/core/runner.py

def _submit_new_job(self):
    current_gen = self.next_generation_to_submit
    
    exec_fname = f"{self.results_dir}/gen_{current_gen}/main.py"
    results_dir = f"{self.results_dir}/gen_{current_gen}/results"
    
    # 创建目录
    Path(results_dir).mkdir(parents=True, exist_ok=True)
    
    # ✅ 新方案:提交到 Eval Service (如果配置)
    if self.eval_service_url:
        job_id = self._submit_to_eval_service(
            exec_fname=exec_fname,
            results_dir=results_dir,
            generation=current_gen
        )
    else:
        # 兼容旧方案:使用 local scheduler
        job_id = self.scheduler.submit_async(exec_fname, results_dir)
    
    # 跟踪 job
    running_job = RunningJob(
        job_id=job_id,
        exec_fname=exec_fname,
        results_dir=results_dir,
        generation=current_gen,
        ...
    )
    self.running_jobs.append(running_job)


def _submit_to_eval_service(
    self,
    exec_fname: str,
    results_dir: str,
    generation: int
) -> str:
    """Submit evaluation request to Eval Service"""
    
    import requests
    
    payload = {
        "program_path": exec_fname,
        "results_dir": results_dir,
        "generation": generation,
        "experiment_root": str(self.results_dir),
        "evaluation_config": {
            "primary_evaluator": self.job_config.eval_program_path,
            "num_runs": 1,
            "timeout": 300,
            "extra_args": self.job_config.extra_cmd_args or {}
        },
        "auxiliary_config": {
            "enabled": True,
            "use_dynamic": True,
            "timeout": 10
        }
    }
    
    response = requests.post(
        f"{self.eval_service_url}/api/v1/evaluate",
        json=payload,
        timeout=5.0  # 快速提交
    )
    
    if response.status_code == 200:
        data = response.json()
        return data["job_id"]  # 返回 eval service 的 job_id
    else:
        raise RuntimeError(f"Eval service submission failed: {response.status_code}")


def _check_completed_jobs(self) -> List[RunningJob]:
    """Check for completed jobs"""
    completed = []
    still_running = []
    
    for job in self.running_jobs:
        if self.eval_service_url:
            # ✅ 新方案:查询 eval service
            is_complete, result = self._query_eval_service(job.job_id)
            if is_complete:
                # 结果已经保存在 job.results_dir/metrics.json
                completed.append(job)
            else:
                still_running.append(job)
        else:
            # 兼容旧方案:使用 local scheduler
            is_running = self.scheduler.check_job_status(job)
            if not is_running:
                completed.append(job)
            else:
                still_running.append(job)
    
    self.running_jobs = still_running
    return completed


def _query_eval_service(self, job_id: str) -> Tuple[bool, Optional[Dict]]:
    """Query evaluation result from Eval Service"""
    
    import requests
    
    try:
        response = requests.get(
            f"{self.eval_service_url}/api/v1/evaluate/{job_id}",
            timeout=2.0
        )
        
        if response.status_code == 200:
            data = response.json()
            if data["status"] == "completed":
                return True, data["evaluation_result"]
            elif data["status"] == "failed":
                return True, None  # Failed but done
            else:
                return False, None  # Still running
        else:
            return False, None
    
    except Exception as e:
        logger.warning(f"Failed to query eval service: {e}")
        return False, None
```

### Step 2: Eval Service 处理评估请求

```python
# eval_agent/ev2_service_standalone.py

@app.post("/api/v1/evaluate")
async def evaluate_program(
    request: EvaluationRequest,
    background_tasks: BackgroundTasks
):
    """
    Main evaluation endpoint
    
    Handles both primary and auxiliary evaluation in one place.
    """
    # 生成 job_id
    job_id = f"eval_{request.generation}_{int(time.time())}"
    
    # 创建 job tracking
    eval_jobs[job_id] = {
        "status": "pending",
        "request": request,
        "started_at": time.time(),
        "result": None
    }
    
    # 启动后台评估任务
    background_tasks.add_task(
        run_full_evaluation,
        job_id=job_id,
        request=request
    )
    
    # 立即返回 (异步)
    return {
        "status": "accepted",
        "job_id": job_id,
        "estimated_time": 15
    }


async def run_full_evaluation(job_id: str, request: EvaluationRequest):
    """
    Complete evaluation pipeline (primary + auxiliary)
    
    This runs in the background.
    """
    logger.info(f"=" * 60)
    logger.info(f"🔄 Starting Evaluation: {job_id}")
    logger.info(f"=" * 60)
    
    try:
        eval_jobs[job_id]["status"] = "running"
        
        # ===== Step 1: Run Primary Evaluator =====
        logger.info("📊 Step 1: Running primary evaluator...")
        
        primary_result = await run_primary_evaluator(request)
        
        if not primary_result["success"]:
            eval_jobs[job_id]["status"] = "failed"
            eval_jobs[job_id]["error"] = primary_result["error"]
            return
        
        logger.info(f"✅ Primary evaluation completed: score={primary_result['combined_score']:.4f}")
        
        # ===== Step 2: Load Auxiliary Evaluators (if enabled) =====
        auxiliary_results = {}
        auxiliary_definitions = {}
        
        if request.auxiliary_config.enabled:
            logger.info("🔧 Step 2: Running auxiliary evaluators...")
            
            # 2a. Dynamic metrics (agent-generated)
            if request.auxiliary_config.use_dynamic:
                dynamic_results = await run_dynamic_auxiliary(
                    experiment_root=request.experiment_root,
                    program_output=primary_result.get("program_output"),
                    timeout=request.auxiliary_config.timeout
                )
                if dynamic_results:
                    auxiliary_results.update(dynamic_results["metrics"])
                    auxiliary_definitions.update(dynamic_results["definitions"])
                    logger.info(f"✅ Dynamic auxiliary: {len(dynamic_results['metrics'])} metrics")
            
            # 2b. Static metrics (pre-defined)
            if request.auxiliary_config.use_static:
                static_results = await run_static_auxiliary(
                    experiment_root=request.experiment_root,
                    program_output=primary_result.get("program_output")
                )
                if static_results:
                    auxiliary_results.update(static_results["metrics"])
                    auxiliary_definitions.update(static_results["definitions"])
                    logger.info(f"✅ Static auxiliary: {len(static_results['metrics'])} metrics")
        
        # ===== Step 3: Merge and Save Results =====
        logger.info("💾 Step 3: Saving complete evaluation results...")
        
        complete_result = {
            "combined_score": primary_result["combined_score"],
            "correct": primary_result["correct"],
            "error": primary_result.get("error"),
            "public_metrics": {
                **primary_result.get("public_metrics", {}),
                **{f"aux_{k}": v for k, v in auxiliary_results.items()}
            },
            "private_metrics": primary_result.get("private_metrics", {}),
            "auxiliary_metric_definitions": auxiliary_definitions,
            "auxiliary_metadata": {
                "executed": len(auxiliary_results) > 0,
                "num_metrics_computed": len(auxiliary_results),
                "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
            },
            "execution_time_mean": primary_result.get("execution_time", 0),
            "num_valid_runs": primary_result.get("num_valid_runs", 0),
            "all_validation_errors": primary_result.get("all_validation_errors", [])
        }
        
        # 保存到 metrics.json
        metrics_file = Path(request.experiment_root) / request.results_dir / "metrics.json"
        with open(metrics_file, 'w') as f:
            json.dump(complete_result, f, indent=2)
        
        # 保存 correct.json
        correct_file = Path(request.experiment_root) / request.results_dir / "correct.json"
        with open(correct_file, 'w') as f:
            json.dump({
                "correct": primary_result["correct"],
                "error": primary_result.get("error")
            }, f, indent=2)
        
        logger.info(f"✅ Results saved to: {metrics_file}")
        
        # ===== Step 4: Update Job Status =====
        eval_jobs[job_id]["status"] = "completed"
        eval_jobs[job_id]["result"] = complete_result
        
        logger.info(f"=" * 60)
        logger.info(f"✅ Evaluation Completed: {job_id}")
        logger.info(f"=" * 60)
    
    except Exception as e:
        logger.error(f"❌ Evaluation failed: {e}", exc_info=True)
        eval_jobs[job_id]["status"] = "failed"
        eval_jobs[job_id]["error"] = str(e)


async def run_primary_evaluator(request: EvaluationRequest) -> Dict[str, Any]:
    """
    Run the primary evaluator (task-specific)
    
    This calls the existing evaluate.py or equivalent.
    """
    import subprocess
    
    # 构建命令
    program_path = Path(request.experiment_root) / request.program_path
    results_dir = Path(request.experiment_root) / request.results_dir
    evaluator_path = request.evaluation_config.primary_evaluator
    
    cmd = [
        "python",
        evaluator_path,
        "--program_path", str(program_path),
        "--results_dir", str(results_dir)
    ]
    
    # 添加额外参数
    for key, value in request.evaluation_config.extra_args.items():
        if isinstance(value, bool):
            if value:
                cmd.append(f"--{key}")
        else:
            cmd.extend([f"--{key}", str(value)])
    
    # 运行评估器
    try:
        process = await asyncio.create_subprocess_exec(
            *cmd,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE
        )
        
        stdout, stderr = await asyncio.wait_for(
            process.communicate(),
            timeout=request.evaluation_config.timeout
        )
        
        if process.returncode != 0:
            return {
                "success": False,
                "error": f"Evaluator failed: {stderr.decode()}"
            }
        
        # 读取结果
        metrics_file = results_dir / "metrics.json"
        if not metrics_file.exists():
            return {
                "success": False,
                "error": "metrics.json not found"
            }
        
        with open(metrics_file) as f:
            metrics = json.load(f)
        
        # 尝试加载程序输出(用于 auxiliary evaluation)
        program_output = load_program_output(results_dir)
        
        return {
            "success": True,
            "combined_score": metrics["combined_score"],
            "correct": True,  # 如果执行到这里
            "public_metrics": metrics.get("public", {}),
            "private_metrics": metrics.get("private", {}),
            "execution_time": metrics.get("execution_time_mean", 0),
            "num_valid_runs": metrics.get("num_valid_runs", 0),
            "all_validation_errors": metrics.get("all_validation_errors", []),
            "program_output": program_output
        }
    
    except asyncio.TimeoutError:
        return {
            "success": False,
            "error": f"Evaluation timeout after {request.evaluation_config.timeout}s"
        }
    except Exception as e:
        return {
            "success": False,
            "error": f"Evaluation error: {str(e)}"
        }


async def run_dynamic_auxiliary(
    experiment_root: str,
    program_output: Any,
    timeout: int
) -> Optional[Dict[str, Any]]:
    """
    Run dynamically generated auxiliary metrics (任务无关)
    
    Key: This works with ANY task by using the generic program_output
    """
    metrics_file = Path(experiment_root) / "eval_agent_memory" / "auxiliary_metrics.py"
    
    if not metrics_file.exists():
        return None
    
    try:
        # 动态导入
        spec = importlib.util.spec_from_file_location("dynamic_aux", metrics_file)
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)
        
        # 查找标准接口
        if not hasattr(module, 'evaluate_auxiliary_metrics'):
            logger.warning("No 'evaluate_auxiliary_metrics' function found")
            return None
        
        eval_fn = module.evaluate_auxiliary_metrics
        
        # ✅ 关键:任务无关的调用
        # program_output 可以是任何格式(dict, tuple, object, etc.)
        # 由 auxiliary_metrics.py 自己知道如何处理
        
        aux_results = await asyncio.wait_for(
            asyncio.to_thread(eval_fn, program_output),
            timeout=timeout
        )
        
        # 提取定义
        definitions = {}
        if hasattr(module, 'METRIC_DEFINITIONS'):
            definitions = module.METRIC_DEFINITIONS
        
        return {
            "metrics": aux_results,
            "definitions": definitions
        }
    
    except asyncio.TimeoutError:
        logger.warning(f"Auxiliary metrics timeout after {timeout}s")
        return None
    except Exception as e:
        logger.error(f"Auxiliary metrics failed: {e}", exc_info=True)
        return None


def load_program_output(results_dir: Path) -> Any:
    """
    Load program output (任务无关)
    
    尝试多种格式,返回最通用的表示
    """
    # 1. 尝试 extra.npz (numpy arrays)
    npz_file = results_dir / "extra.npz"
    if npz_file.exists():
        try:
            data = np.load(npz_file)
            return {k: data[k] for k in data.files}
        except:
            pass
    
    # 2. 尝试 extra.pkl (pickle)
    pkl_file = results_dir / "extra.pkl"
    if pkl_file.exists():
        try:
            import pickle
            with open(pkl_file, 'rb') as f:
                return pickle.load(f)
        except:
            pass
    
    # 3. 尝试 extra.json (json)
    json_file = results_dir / "extra.json"
    if json_file.exists():
        try:
            with open(json_file) as f:
                return json.load(f)
        except:
            pass
    
    # 4. 从 metrics.json 提取
    metrics_file = results_dir / "metrics.json"
    if metrics_file.exists():
        try:
            with open(metrics_file) as f:
                metrics = json.load(f)
            # 返回 public + private 作为 output
            return {
                "public": metrics.get("public", {}),
                "private": metrics.get("private", {})
            }
        except:
            pass
    
    return None
```

---

## 🎨 Agent Prompt 的通用化

### 新的 Prompt 要求

```jinja2
You are an evaluation expert for optimization problems.

<TASK_CONTEXT>
Task: {{ task_name }}
Objective: {{ objective }}
Primary Metric: {{ primary_metric_name }} ({{ primary_metric_interpretation }})
</TASK_CONTEXT>

<YOUR_ROLE>
Design AUXILIARY metrics that complement the primary metric.
These metrics should:
1. Measure aspects NOT captured by primary metric
2. Work with the program output format
3. Be computationally efficient
</YOUR_ROLE>

<PROGRAM_OUTPUT_FORMAT>
The program output is available as a Python object with structure:
{{ program_output_structure }}

Example:
{{ program_output_example }}
</PROGRAM_OUTPUT_FORMAT>

<REQUIRED_INTERFACE>
You MUST create auxiliary_metrics.py with:

```python
def evaluate_auxiliary_metrics(program_output) -> dict:
    """
    Evaluate auxiliary metrics.
    
    Args:
        program_output: Program output (format depends on task)
            For circle packing: {"centers": ndarray, "radii": ndarray}
            For TSP: {"tour": list, "distance": float}
            For code: {"code": str, "ast": dict}
    
    Returns:
        dict: Auxiliary metrics
    """
    # Extract what you need from program_output
    # Task-specific logic here
    
    return {
        "metric_name_1": value1,
        "metric_name_2": value2,
    }


METRIC_DEFINITIONS = {
    "metric_name_1": {
        "name": "Human Readable Name",
        "description": "What this metric measures",
        "interpretation": "higher_better" | "lower_better" | "neutral",
        "unit": "...",
        "formula": "..."
    },
    ...
}
```
</REQUIRED_INTERFACE>

<EXAMPLES>
{% for example in auxiliary_examples %}
{{ example }}
{% endfor %}
</EXAMPLES>
```

---

## 📋 迁移路径

### Phase 1: 保持向后兼容 (1周)

**目标**: 支持新旧两种模式

```python
# Config option
class EvolutionConfig:
    eval_service_url: Optional[str] = None
    use_eval_service_for_evaluation: bool = False  # ← 新选项
```

- `use_eval_service_for_evaluation = False`: 使用旧方案(scheduler → evaluate.py)
- `use_eval_service_for_evaluation = True`: 使用新方案(eval service 负责)

### Phase 2: 实现新的 Eval Service API (2周)

- [ ] 实现 `/api/v1/evaluate` endpoint
- [ ] 实现 `run_full_evaluation` pipeline
- [ ] 实现通用的 `load_program_output`
- [ ] 测试多种任务格式

### Phase 3: 更新 Agent Prompt (1周)

- [ ] 通用化 prompt 模板
- [ ] 添加 task-specific examples
- [ ] 测试 agent 生成的代码

### Phase 4: 全面切换 (1周)

- [ ] 默认启用新模式
- [ ] 废弃旧 scheduler 调用方式
- [ ] 文档更新

---

## 🎯 成功标准

### 通用性验证

测试 3 个不同类型的任务:

1. **Circle Packing** (几何优化)
   - program_output: `{"centers": ndarray, "radii": ndarray}`
   - auxiliary metrics: spatial uniformity, radius distribution

2. **TSP** (组合优化)
   - program_output: `{"tour": list, "distance": float}`
   - auxiliary metrics: tour smoothness, cluster quality

3. **Code Optimization** (程序综合)
   - program_output: `{"code": str, "ast": dict, "runtime": float}`
   - auxiliary metrics: code complexity, readability, efficiency

如果所有3个任务都能用同样的架构,则验证通过 ✅

### 职责清晰性验证

- [ ] ShinkaEvolve 不直接调用 evaluate.py
- [ ] Eval Service 完全负责评估
- [ ] Primary 和 auxiliary 在同一次调用中完成
- [ ] 结果格式统一

---

## 📄 总结

### 关键改进

1. **✅ 任务无关**: 
   - 使用通用的 `program_output` 格式
   - Agent 适应不同任务的输出结构

2. **✅ 职责清晰**:
   - ShinkaEvolve: 演化逻辑
   - Eval Service: 完整评估(primary + auxiliary)

3. **✅ 一次完成**:
   - 一次运行得到所有 metrics
   - 避免重复加载程序输出

### 下一步

1. **立即开始**: 实现新的 `/api/v1/evaluate` API
2. **保持兼容**: 通过 config flag 支持新旧模式
3. **逐步迁移**: Phase 1-4 按顺序实施