| # Eval Service 统一 API 文档 |
|
|
| ## 📋 概述 |
|
|
| `ev2_service_standalone.py` 现已支持**统一的接口**,可以自动判断工作模式: |
|
|
| 1. **评估模式(新)**: 提供 `code_path` + `evaluator_module` → Service 执行评估 |
| 2. **通知模式(旧)**: 仅提供 `primary_score` → 向后兼容旧代码 |
|
|
| ## 🚀 启动服务 |
|
|
| ```bash |
| python eval_agent/ev2_service_standalone.py \ |
| --results-dir /path/to/experiment \ |
| --primary-evaluator examples/circle_packing/evaluate.py \ |
| --trigger-mode periodic \ |
| --trigger-interval 10 \ |
| --port 8765 |
| ``` |
|
|
| 或使用配置文件: |
|
|
| ```bash |
| python eval_agent/ev2_service_standalone.py --config config.yaml |
| ``` |
|
|
| ## 📡 API Endpoints |
|
|
| ### 1. Generation Complete (统一入口) |
|
|
| `POST /api/v1/notify/generation_complete` |
|
|
| #### 模式 1: 评估模式 (NEW) |
|
|
| **请求**: |
| ```json |
| { |
| "generation": 10, |
| "code_path": "/path/to/gen_10/main.py", |
| "results_dir": "/path/to/gen_10/results", |
| "evaluator_module": "examples.circle_packing.evaluate_ori", |
| "evaluator_function": "main", |
| "evaluator_kwargs": {} |
| } |
| ``` |
|
|
| **响应** (立即返回,< 100ms): |
| ```json |
| { |
| "status": "accepted", |
| "generation": 10, |
| "job_id": "eval_10_1738512345", |
| "estimated_time": 15.0, |
| "agent_triggered": false, |
| "trigger_reason": "Will be determined after evaluation", |
| "processing_time_ms": 50.2 |
| } |
| ``` |
|
|
| **后台执行**: |
| 1. 运行 primary evaluator |
| 2. 运行 auxiliary evaluators(如果存在) |
| 3. 保存 `metrics.json` |
| 4. 决定是否触发 Agent |
| 5. 如果触发:运行 EV2 Agent 分析 |
|
|
| #### 模式 2: 通知模式 (LEGACY - 向后兼容) |
|
|
| **请求**: |
| ```json |
| { |
| "generation": 10, |
| "results_dir": "/path/to/gen_10/results", |
| "primary_score": 0.85 |
| } |
| ``` |
|
|
| **响应** (同步): |
| ```json |
| { |
| "status": "completed", |
| "generation": 10, |
| "job_id": null, |
| "agent_triggered": false, |
| "trigger_reason": "Not yet (last trigger at gen 0)", |
| "processing_time_ms": 5.1 |
| } |
| ``` |
|
|
| ### 2. 查询 Generation 状态 |
|
|
| `GET /api/v1/generation/{generation}/status` |
|
|
| **响应**: |
| ```json |
| { |
| "generation": 10, |
| "job_id": "eval_10_1738512345", |
| "status": "running", // "pending" | "running" | "completed" | "failed" |
| "created_at": 1738512345.0, |
| "elapsed_time": 5.2 |
| } |
| ``` |
|
|
| 如果已完成: |
| ```json |
| { |
| "generation": 10, |
| "job_id": "eval_10_1738512345", |
| "status": "completed", |
| "created_at": 1738512345.0, |
| "completed_at": 1738512360.0, |
| "elapsed_time": 15.0, |
| "result": { |
| "combined_score": 0.85, |
| "primary": {...}, |
| "auxiliary": {...}, |
| "timestamp": 1738512360.0 |
| } |
| } |
| ``` |
|
|
| ### 3. 查询 Job 状态 |
|
|
| `GET /api/v1/evaluate/{job_id}` |
|
|
| **响应**:同上 |
|
|
| ### 4. 服务状态 |
|
|
| `GET /api/v1/status` |
|
|
| **响应**: |
| ```json |
| { |
| "status": "running", |
| "uptime_seconds": 12345.6, |
| "version": "2.0.0-standalone", |
| "experiment": { |
| "name": "circle-packing", |
| "results_dir": "/path/to/experiment", |
| "primary_evaluator": "examples/circle_packing/evaluate.py" |
| }, |
| "statistics": { |
| "total_notifications": 20, |
| "total_agent_runs": 2, |
| "generations_tracked": 20, |
| "last_agent_trigger_gen": 10 |
| }, |
| "config": { |
| "trigger_mode": "periodic", |
| "trigger_interval": 10, |
| "agent_enabled": true, |
| "agent_initialized": true |
| } |
| } |
| ``` |
|
|
| ## 📊 使用示例 |
|
|
| ### Python 客户端 |
|
|
| ```python |
| import requests |
| import time |
| |
| SERVICE_URL = "http://localhost:8765" |
| |
| # 评估模式 |
| def submit_evaluation(generation, code_path): |
| response = requests.post( |
| f"{SERVICE_URL}/api/v1/notify/generation_complete", |
| json={ |
| "generation": generation, |
| "code_path": code_path, |
| "results_dir": f"/path/to/gen_{generation}/results", |
| "evaluator_module": "examples.circle_packing.evaluate_ori", |
| "evaluator_function": "main" |
| }, |
| timeout=5.0 |
| ) |
| |
| data = response.json() |
| job_id = data['job_id'] |
| |
| print(f"Submitted: job_id={job_id}") |
| |
| # 轮询状态 |
| while True: |
| status_response = requests.get( |
| f"{SERVICE_URL}/api/v1/generation/{generation}/status" |
| ) |
| |
| status_data = status_response.json() |
| |
| if status_data['status'] == 'completed': |
| result = status_data['result'] |
| return result['combined_score'] |
| elif status_data['status'] == 'failed': |
| raise RuntimeError(f"Evaluation failed: {status_data.get('error')}") |
| |
| time.sleep(2) |
| ``` |
|
|
| ### ShinkaEvolve 集成 |
|
|
| 修改 `shinka/core/runner.py`: |
|
|
| ```python |
| class EvolutionConfig: |
| eval_service_url: Optional[str] = None |
| use_eval_service: bool = False |
| evaluator_module: Optional[str] = None |
| |
| class EvolutionRunner: |
| def _submit_new_job(self): |
| # ... 生成代码 ... |
| |
| if self.eval_service_url and self.evo_config.use_eval_service: |
| # 使用 Eval Service |
| job_id = self._submit_to_eval_service( |
| generation=current_gen, |
| code_path=str(exec_fname), |
| results_dir=str(results_dir) |
| ) |
| else: |
| # 旧方式 |
| job_id = self.scheduler.submit_async(exec_fname, results_dir) |
| |
| running_job = RunningJob( |
| job_id=job_id, |
| use_eval_service=self.evo_config.use_eval_service, |
| ... |
| ) |
| |
| def _submit_to_eval_service(self, generation, code_path, results_dir): |
| response = requests.post( |
| f"{self.eval_service_url}/api/v1/notify/generation_complete", |
| json={ |
| "generation": generation, |
| "code_path": code_path, |
| "results_dir": results_dir, |
| "evaluator_module": self.evo_config.evaluator_module |
| }, |
| timeout=5.0 |
| ) |
| |
| return response.json()['job_id'] |
| |
| def _check_completed_jobs(self): |
| for job in self.running_jobs: |
| if job.use_eval_service: |
| # 查询 eval service |
| response = requests.get( |
| f"{self.eval_service_url}/api/v1/generation/{job.generation}/status" |
| ) |
| |
| if response.json()['status'] == 'completed': |
| completed.append(job) |
| ``` |
|
|
| ## 🧪 测试 |
|
|
| 运行测试脚本: |
|
|
| ```bash |
| # 1. 启动服务 |
| python eval_agent/ev2_service_standalone.py \ |
| --results-dir /tmp/test \ |
| --primary-evaluator examples/circle_packing/evaluate.py |
| |
| # 2. 运行测试 |
| python test_eval_service_unified.py |
| ``` |
|
|
| 测试覆盖: |
| - ✅ 服务健康检查 |
| - ✅ 通知模式(向后兼容) |
| - ✅ 评估模式(异步) |
| - ✅ 状态查询(按 generation 和 job_id) |
| |
| ## 🔧 工作流程 |
| |
| ### 评估模式完整流程 |
| |
| ``` |
| 1. ShinkaEvolve 生成代码 |
| ↓ |
| 2. POST /api/v1/notify/generation_complete |
| { |
| generation: 10, |
| code_path: "gen_10/main.py", |
| evaluator_module: "examples.circle_packing.evaluate" |
| } |
| ↓ |
| 3. 立即返回 (< 100ms) |
| { |
| status: "accepted", |
| job_id: "eval_10_..." |
| } |
| ↓ |
| 4. Eval Service 后台执行: |
| - 运行 primary evaluator → combined_score |
| - 运行 auxiliary evaluators → {diversity, ...} |
| - 保存 metrics.json |
| - 决定是否触发 Agent |
| - 如果触发:运行 EV2 Agent 分析 |
| ↓ |
| 5. ShinkaEvolve 轮询: |
| GET /api/v1/generation/10/status |
| → status: "running" |
| → status: "running" |
| → status: "completed", result: {...} |
| ↓ |
| 6. ShinkaEvolve 获取 combined_score,继续下一代 |
| ``` |
| |
| ## ⚙️ 配置 |
|
|
| ### Evaluator Contract |
|
|
| 任何任务的 evaluator 必须满足: |
|
|
| ```python |
| def evaluate(code_path: str, **kwargs) -> Dict[str, Any]: |
| """ |
| 评估函数约定 |
| |
| Args: |
| code_path: 生成的代码路径 |
| **kwargs: 额外参数 |
| |
| Returns: |
| { |
| "combined_score": float, # 必需 |
| "metrics": Dict[str, Any], # 可选 |
| "metadata": Dict[str, Any] # 可选 |
| } |
| """ |
| # 运行代码 |
| result = run_code(code_path) |
| |
| # 计算分数 |
| score = compute_score(result) |
| |
| return { |
| "combined_score": score, |
| "metrics": {"coverage": 0.8}, |
| "metadata": {"num_items": 100} |
| } |
| ``` |
|
|
| ### Auxiliary Metrics |
|
|
| Agent 生成的 auxiliary metrics 位于: |
| ``` |
| experiment_root/ |
| └── eval_agent_memory/ |
| └── auxiliary_metrics.py # Agent 生成 |
| ``` |
|
|
| 约定:所有以 `evaluate_` 开头的函数都会被自动调用: |
|
|
| ```python |
| def evaluate_diversity(code_path: str, primary_result: Dict) -> Dict[str, Any]: |
| """多样性指标""" |
| return {"diversity_score": 0.7} |
| |
| def evaluate_robustness(code_path: str, primary_result: Dict) -> Dict[str, Any]: |
| """鲁棒性指标""" |
| return {"robustness_score": 0.8} |
| ``` |
|
|
| ## 📝 迁移指南 |
|
|
| ### 从旧模式迁移到新模式 |
|
|
| **旧代码**(ShinkaEvolve 自己评估): |
| ```python |
| # 1. 生成代码 |
| # 2. 运行评估 |
| combined_score = evaluate(code_path) |
| # 3. 通知 service |
| requests.post(url, json={ |
| "generation": gen, |
| "primary_score": combined_score |
| }) |
| ``` |
|
|
| **新代码**(Eval Service 负责评估): |
| ```python |
| # 1. 生成代码 |
| # 2. 提交到 service(不等待) |
| response = requests.post(url, json={ |
| "generation": gen, |
| "code_path": code_path, |
| "evaluator_module": "examples.task.evaluate" |
| }) |
| job_id = response.json()['job_id'] |
| |
| # 3. 轮询状态 |
| while True: |
| status = requests.get(f"{url}/generation/{gen}/status") |
| if status.json()['status'] == 'completed': |
| combined_score = status.json()['result']['combined_score'] |
| break |
| ``` |
|
|
| ## 🎯 优势 |
|
|
| 1. **统一接口**: 一个 endpoint 处理所有情况 |
| 2. **自动判断**: 根据参数自动选择模式 |
| 3. **向后兼容**: 旧代码无需修改 |
| 4. **职责清晰**: 评估由 Service 统一管理 |
| 5. **异步高效**: 立即返回,不阻塞 |
| 6. **并发支持**: 可同时处理多个评估 |
|
|
| ## 📊 性能 |
|
|
| - 提交请求: < 100ms |
| - 评估执行: 10-30秒(取决于 evaluator) |
| - 状态查询: < 10ms |
| - 并发支持: 可同时处理多个 generation 的评估 |
|
|