| # Eval Service ้ๆฐ่ฎพ่ฎก - ๆ ธๅฟ่ฆ็น |
|
|
| ## ๐ฏ ไธคไธชๅ
ณ้ฎ้ฎ้ข |
|
|
| ### 1. **ๆณๅๆงไธ่ถณ** โ |
| **ๅ่ฎพ่ฎก**: |
| ```python |
| # ๅ่ฎพไบ circle packing ็นๅฎๆ ผๅผ |
| def load_program_output(results_dir): |
| data = np.load("extra.npz") |
| return data['centers'], data['radii'] # โ ๅช้็จไบ circle packing |
| ``` |
|
|
| **ๆฐ่ฎพ่ฎก** โ
: |
| ```python |
| # ไปปๅกๆ ๅ
ณ็้็จๆฅๅฃ |
| def evaluate_auxiliary_metrics(program_output) -> dict: |
| """ |
| Args: |
| program_output: ไปปๆๆ ผๅผ |
| - Circle packing: {"centers": ndarray, "radii": ndarray} |
| - TSP: {"tour": list, "distance": float} |
| - Code: {"code": str, "ast": dict, "runtime": float} |
| """ |
| # Agent ่ชๅทฑ้ๅบไธๅ็ๆ ผๅผ |
| pass |
| ``` |
|
|
| ### 2. **่่ดฃไธๆธ
** โ |
| **ๅ่ฎพ่ฎก**: |
| ``` |
| ShinkaEvolve โ scheduler โ evaluate.py โ metrics.json |
| โ |
| ้็ฅ Eval Service |
| โ |
| [ๅชๆฏๆ่ง่
๏ผ็ๆ auxiliary_metrics.py] |
| ``` |
|
|
| **ๆฐ่ฎพ่ฎก** โ
: |
| ``` |
| ShinkaEvolve โ ๆไบค่ฏไผฐ่ฏทๆฑ โ Eval Service |
| โ |
| ่ฟ่ก primary evaluator |
| โ |
| ่ฟ่ก auxiliary evaluators |
| โ |
| ไฟๅญๅฎๆด metrics.json |
| โ |
| ่ฟๅ็ปๆ็ป ShinkaEvolve |
| ``` |
|
|
| --- |
|
|
| ## ๐๏ธ ๆ ธๅฟ่ฎพ่ฎกๅๅ |
|
|
| ### 1. **่่ดฃๆ็กฎๅ็ฆป** |
|
|
| | ็ปไปถ | ่่ดฃ | ไธ่ด่ดฃ | |
| |------|------|--------| |
| | **ShinkaEvolve** | โข ็ๆไปฃ็ <br>โข ๆผๅๅณ็ญ<br>โข ๆฐๆฎๅบ็ฎก็ | โ ่ฏไผฐ็จๅบ | |
| | **Eval Service** | โข ่ฟ่ก primary evaluator<br>โข ่ฟ่ก auxiliary evaluators<br>โข ็ๆๅฎๆด metrics | โ ๆผๅ้ป่พ | |
|
|
| ### 2. **้็จๆฅๅฃๅฅ็บฆ** |
|
|
| **Evaluator Contract** (ไปปไฝไปปๅกๅฟ
้กป้ตๅพช): |
| ```python |
| # Primary evaluator ่พๅบๆ ๅๆ ผๅผ |
| { |
| "combined_score": float, # ๅฟ
้กป |
| "correct": bool, # ๅฟ
้กป |
| "public_metrics": dict, # ๅฏ้ |
| "private_metrics": dict, # ๅฏ้ |
| "program_output": Any # ๅฏ้๏ผไปปๅก็นๅฎ |
| } |
| |
| # Auxiliary evaluator ๆ ๅๆฅๅฃ |
| def evaluate_auxiliary_metrics(program_output) -> dict: |
| """ๆฅๅไปปๆๆ ผๅผ็ program_output""" |
| pass |
| ``` |
|
|
| ### 3. **ไธๆฌกๆงๅฎๆ** |
|
|
| ```python |
| # ๆฐๆต็จ๏ผไธๆฌก API ่ฐ็จๅฎๆๆๆ่ฏไผฐ |
| POST /api/v1/evaluate |
| โ |
| run_full_evaluation: |
| 1. run_primary_evaluator() โ primary metrics |
| 2. run_dynamic_auxiliary() โ auxiliary metrics (agent-generated) |
| 3. run_static_auxiliary() โ auxiliary metrics (pre-defined) |
| 4. merge_and_save() โ ๅฎๆด metrics.json |
| ``` |
|
|
| --- |
|
|
| ## ๐ API ่ฎพ่ฎก |
|
|
| ### ๆไบค่ฏไผฐ (ๅผๆญฅ) |
|
|
| ```python |
| # POST /api/v1/evaluate |
| { |
| "program_path": "gen_42/main.py", |
| "results_dir": "gen_42/results", |
| "generation": 42, |
| "experiment_root": "/path/to/experiment", |
| "evaluation_config": { |
| "primary_evaluator": "examples/circle_packing/evaluate.py", |
| "timeout": 300 |
| }, |
| "auxiliary_config": { |
| "enabled": true, |
| "use_dynamic": true, # agent ็ๆ็ |
| "use_static": false # ้ขๅฎไน็ |
| } |
| } |
| |
| # Response |
| { |
| "status": "accepted", |
| "job_id": "eval_job_42" |
| } |
| ``` |
|
|
| ### ๆฅ่ฏข็ปๆ |
|
|
| ```python |
| # GET /api/v1/evaluate/{job_id} |
| { |
| "status": "completed", |
| "evaluation_result": { |
| "combined_score": 2.34, |
| "public_metrics": { |
| "num_circles": 26, |
| "aux_radius_std_dev": 0.031, # โ auxiliary |
| "aux_spatial_uniformity": 0.82 # โ auxiliary |
| }, |
| "auxiliary_metric_definitions": { |
| "aux_radius_std_dev": { |
| "name": "Radius Standard Deviation", |
| "description": "...", |
| "interpretation": "lower_better" |
| } |
| } |
| } |
| } |
| ``` |
|
|
| --- |
|
|
| ## ๐ ๅฎๆดๆฐๆฎๆต |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ShinkaEvolve (Generation 42) โ |
| โ 1. ็ๆไปฃ็ : gen_42/main.py โ |
| โ 2. ๆไบค่ฏไผฐ่ฏทๆฑ โ Eval Service โ |
| โ 3. ่ฝฎ่ฏข็ถๆ๏ผ็ญๅพ
ๅฎๆ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ POST /api/v1/evaluate |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ Eval Service (ๅๅฐไปปๅก) โ |
| โ 1. ่ฟ่ก primary: evaluate.py โ |
| โ โ combined_score, program_output โ |
| โ โ |
| โ 2. ๅ ่ฝฝ auxiliary_metrics.py (ๅฆๆๅญๅจ) โ |
| โ โ |
| โ 3. ่ฟ่ก auxiliary: โ |
| โ evaluate_auxiliary_metrics( โ |
| โ program_output โ |
| โ ) โ |
| โ โ aux_radius_std_dev, aux_... โ |
| โ โ |
| โ 4. ๅๅนถ็ปๆ๏ผไฟๅญ metrics.json โ |
| โ { โ |
| โ "combined_score": 2.34, โ |
| โ "public_metrics": { โ |
| โ "num_circles": 26, โ |
| โ "aux_*": ... โ |
| โ }, โ |
| โ "auxiliary_metric_definitions": { โ |
| โ ... โ |
| โ } โ |
| โ } โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ็ปๆๅๅคๅฅฝ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ ShinkaEvolve (็ปง็ปญๆผๅ) โ |
| โ 1. GET /api/v1/evaluate/{job_id} โ |
| โ 2. ่ฏปๅ combined_score โ |
| โ 3. ๆผๅๅณ็ญ (้ๆฉใไบคๅใๅๅผ) โ |
| โ 4. [ๅฏ้] auxiliary metrics โ LLM โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| --- |
|
|
| ## ๐จ Agent Prompt ้็จๅ |
|
|
| ### ๆง Prompt (้ๅฏน circle packing) |
| ``` |
| ๅๆ centers ๅ radii... |
| ``` |
|
|
| ### ๆฐ Prompt (ไปปๅกๆ ๅ
ณ) |
| ```jinja2 |
| <TASK_CONTEXT> |
| Task: {{ task_name }} |
| Primary Metric: {{ primary_metric_name }} |
| </TASK_CONTEXT> |
| |
| <PROGRAM_OUTPUT_FORMAT> |
| The program output structure: |
| {{ program_output_structure }} |
| |
| Example: |
| {{ program_output_example }} |
| </PROGRAM_OUTPUT_FORMAT> |
| |
| <REQUIRED_INTERFACE> |
| def evaluate_auxiliary_metrics(program_output) -> dict: |
| """ |
| Args: |
| program_output: ๆ นๆฎไปปๅกไธๅ๏ผๆ ผๅผไธๅ |
| Circle packing: {"centers": ndarray, "radii": ndarray} |
| TSP: {"tour": list, "distance": float} |
| Code: {"code": str, "ast": dict} |
| |
| Returns: |
| dict: Auxiliary metrics |
| """ |
| # ไฝ ็ไปฃ็ ้ๅบ program_output ็ๆ ผๅผ |
| pass |
| </REQUIRED_INTERFACE> |
| ``` |
|
|
| --- |
|
|
| ## ๐ ๅฎๆฝ่ทฏๅพ |
|
|
| ### Phase 1: ๅๅๅ
ผๅฎน (1ๅจ) |
| ```python |
| class EvolutionConfig: |
| use_eval_service_for_evaluation: bool = False # ๆฐ้้กน |
| ``` |
| - `False`: ไฝฟ็จๆงๆนๆก (scheduler โ evaluate.py) |
| - `True`: ไฝฟ็จๆฐๆนๆก (eval service ่ด่ดฃ) |
|
|
| ### Phase 2: ๅฎ็ฐๆฐ API (2ๅจ) |
| - [ ] `/api/v1/evaluate` endpoint |
| - [ ] `run_full_evaluation()` pipeline |
| - [ ] ้็จ `load_program_output()` |
| - [ ] ้็จ `run_dynamic_auxiliary()` |
|
|
| ### Phase 3: ๆดๆฐ Agent (1ๅจ) |
| - [ ] ้็จๅ prompt ๆจกๆฟ |
| - [ ] ๆต่ฏไธๅไปปๅก |
|
|
| ### Phase 4: ๅ
จ้ขๅๆข (1ๅจ) |
| - [ ] ้ป่ฎคๅฏ็จๆฐๆจกๅผ |
| - [ ] ๅบๅผๆงๆนๆก |
| - [ ] ๆๆกฃๆดๆฐ |
|
|
| --- |
|
|
| ## ๐ฏ ้ช่ฏๆ ๅ |
|
|
| ### ้็จๆงๆต่ฏ |
| ๅจ 3 ไธชไธๅไปปๅกไธ้ช่ฏ๏ผ |
|
|
| 1. **Circle Packing** (ๅ ไฝ) |
| ```python |
| program_output = {"centers": ndarray, "radii": ndarray} |
| ``` |
|
|
| 2. **TSP** (็ปๅ) |
| ```python |
| program_output = {"tour": [0,3,1,4,2], "distance": 142.5} |
| ``` |
|
|
| 3. **Code Optimization** (็จๅบ) |
| ```python |
| program_output = { |
| "code": "def foo(): ...", |
| "ast": {...}, |
| "runtime": 0.05 |
| } |
| ``` |
|
|
| โ
ๅฆๆๅๆ ท็ๆถๆ้็จไบๆๆ 3 ไธชไปปๅก โ ้ช่ฏ้่ฟ |
|
|
| --- |
|
|
| ## ๐ก ๅ
ณ้ฎไผๅฟ |
|
|
| ### vs ๅ่ฎพ่ฎก |
|
|
| | ๆน้ข | ๅ่ฎพ่ฎก | ๆฐ่ฎพ่ฎก | |
| |------|--------|--------| |
| | **ๆณๅๆง** | โ ๅช้็จ circle packing | โ
้็จไปปๆไปปๅก | |
| | **่่ดฃ** | โ ShinkaEvolve ่ด่ดฃ่ฏไผฐ | โ
Eval Service ่ด่ดฃ | |
| | **ๆ็** | โ ้่ฆ้ๆฐๅ ่ฝฝๆฐๆฎ | โ
ไธๆฌกๅฎๆ | |
| | **ๅฏ็ปดๆคๆง** | โ ้ป่พๅๆฃ | โ
่่ดฃๆธ
ๆฐ | |
|
|
| --- |
|
|
| ## ๐ ๆๆกฃไฝ็ฝฎ |
|
|
| - ่ฏฆ็ป่ฎพ่ฎก: `docs/eval_service_redesign_v2.md` |
| - ๅๅๆ: `docs/eval_service_metrics_analysis.md` |
| - ๅ่ฎกๅ: `docs/eval_service_integration_plan.md` (ๅทฒ่ฟๆถ) |
|
|