shinka-backup / docs /eval_service_redesign_summary.md
JustinTX's picture
Add files using upload-large-folder tool
1556404 verified
# Eval Service ้‡ๆ–ฐ่ฎพ่ฎก - ๆ ธๅฟƒ่ฆ็‚น
## ๐ŸŽฏ ไธคไธชๅ…ณ้”ฎ้—ฎ้ข˜
### 1. **ๆณ›ๅŒ–ๆ€งไธ่ถณ** โŒ
**ๅŽŸ่ฎพ่ฎก**:
```python
# ๅ‡่ฎพไบ† circle packing ็‰นๅฎšๆ ผๅผ
def load_program_output(results_dir):
data = np.load("extra.npz")
return data['centers'], data['radii'] # โ† ๅช้€‚็”จไบŽ circle packing
```
**ๆ–ฐ่ฎพ่ฎก** โœ…:
```python
# ไปปๅŠกๆ— ๅ…ณ็š„้€š็”จๆŽฅๅฃ
def evaluate_auxiliary_metrics(program_output) -> dict:
"""
Args:
program_output: ไปปๆ„ๆ ผๅผ
- Circle packing: {"centers": ndarray, "radii": ndarray}
- TSP: {"tour": list, "distance": float}
- Code: {"code": str, "ast": dict, "runtime": float}
"""
# Agent ่‡ชๅทฑ้€‚ๅบ”ไธๅŒ็š„ๆ ผๅผ
pass
```
### 2. **่Œ่ดฃไธๆธ…** โŒ
**ๅŽŸ่ฎพ่ฎก**:
```
ShinkaEvolve โ†’ scheduler โ†’ evaluate.py โ†’ metrics.json
โ†“
้€š็Ÿฅ Eval Service
โ†“
[ๅชๆ˜ฏๆ—่ง‚่€…๏ผŒ็”Ÿๆˆ auxiliary_metrics.py]
```
**ๆ–ฐ่ฎพ่ฎก** โœ…:
```
ShinkaEvolve โ†’ ๆไบค่ฏ„ไผฐ่ฏทๆฑ‚ โ†’ Eval Service
โ†“
่ฟ่กŒ primary evaluator
โ†“
่ฟ่กŒ auxiliary evaluators
โ†“
ไฟๅญ˜ๅฎŒๆ•ด metrics.json
โ†“
่ฟ”ๅ›ž็ป“ๆžœ็ป™ ShinkaEvolve
```
---
## ๐Ÿ—๏ธ ๆ ธๅฟƒ่ฎพ่ฎกๅŽŸๅˆ™
### 1. **่Œ่ดฃๆ˜Ž็กฎๅˆ†็ฆป**
| ็ป„ไปถ | ่Œ่ดฃ | ไธ่ดŸ่ดฃ |
|------|------|--------|
| **ShinkaEvolve** | โ€ข ็”Ÿๆˆไปฃ็ <br>โ€ข ๆผ”ๅŒ–ๅ†ณ็ญ–<br>โ€ข ๆ•ฐๆฎๅบ“็ฎก็† | โŒ ่ฏ„ไผฐ็จ‹ๅบ |
| **Eval Service** | โ€ข ่ฟ่กŒ primary evaluator<br>โ€ข ่ฟ่กŒ auxiliary evaluators<br>โ€ข ็”ŸๆˆๅฎŒๆ•ด metrics | โŒ ๆผ”ๅŒ–้€ป่พ‘ |
### 2. **้€š็”จๆŽฅๅฃๅฅ‘็บฆ**
**Evaluator Contract** (ไปปไฝ•ไปปๅŠกๅฟ…้กป้ตๅพช):
```python
# Primary evaluator ่พ“ๅ‡บๆ ‡ๅ‡†ๆ ผๅผ
{
"combined_score": float, # ๅฟ…้กป
"correct": bool, # ๅฟ…้กป
"public_metrics": dict, # ๅฏ้€‰
"private_metrics": dict, # ๅฏ้€‰
"program_output": Any # ๅฏ้€‰๏ผŒไปปๅŠก็‰นๅฎš
}
# Auxiliary evaluator ๆ ‡ๅ‡†ๆŽฅๅฃ
def evaluate_auxiliary_metrics(program_output) -> dict:
"""ๆŽฅๅ—ไปปๆ„ๆ ผๅผ็š„ program_output"""
pass
```
### 3. **ไธ€ๆฌกๆ€งๅฎŒๆˆ**
```python
# ๆ–ฐๆต็จ‹๏ผšไธ€ๆฌก API ่ฐƒ็”จๅฎŒๆˆๆ‰€ๆœ‰่ฏ„ไผฐ
POST /api/v1/evaluate
โ†“
run_full_evaluation:
1. run_primary_evaluator() โ†’ primary metrics
2. run_dynamic_auxiliary() โ†’ auxiliary metrics (agent-generated)
3. run_static_auxiliary() โ†’ auxiliary metrics (pre-defined)
4. merge_and_save() โ†’ ๅฎŒๆ•ด metrics.json
```
---
## ๐Ÿ“Š API ่ฎพ่ฎก
### ๆไบค่ฏ„ไผฐ (ๅผ‚ๆญฅ)
```python
# POST /api/v1/evaluate
{
"program_path": "gen_42/main.py",
"results_dir": "gen_42/results",
"generation": 42,
"experiment_root": "/path/to/experiment",
"evaluation_config": {
"primary_evaluator": "examples/circle_packing/evaluate.py",
"timeout": 300
},
"auxiliary_config": {
"enabled": true,
"use_dynamic": true, # agent ็”Ÿๆˆ็š„
"use_static": false # ้ข„ๅฎšไน‰็š„
}
}
# Response
{
"status": "accepted",
"job_id": "eval_job_42"
}
```
### ๆŸฅ่ฏข็ป“ๆžœ
```python
# GET /api/v1/evaluate/{job_id}
{
"status": "completed",
"evaluation_result": {
"combined_score": 2.34,
"public_metrics": {
"num_circles": 26,
"aux_radius_std_dev": 0.031, # โ† auxiliary
"aux_spatial_uniformity": 0.82 # โ† auxiliary
},
"auxiliary_metric_definitions": {
"aux_radius_std_dev": {
"name": "Radius Standard Deviation",
"description": "...",
"interpretation": "lower_better"
}
}
}
}
```
---
## ๐Ÿ”„ ๅฎŒๆ•ดๆ•ฐๆฎๆต
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ShinkaEvolve (Generation 42) โ”‚
โ”‚ 1. ็”Ÿๆˆไปฃ็ : gen_42/main.py โ”‚
โ”‚ 2. ๆไบค่ฏ„ไผฐ่ฏทๆฑ‚ โ†’ Eval Service โ”‚
โ”‚ 3. ่ฝฎ่ฏข็Šถๆ€๏ผŒ็ญ‰ๅพ…ๅฎŒๆˆ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“ POST /api/v1/evaluate
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Eval Service (ๅŽๅฐไปปๅŠก) โ”‚
โ”‚ 1. ่ฟ่กŒ primary: evaluate.py โ”‚
โ”‚ โ†’ combined_score, program_output โ”‚
โ”‚ โ”‚
โ”‚ 2. ๅŠ ่ฝฝ auxiliary_metrics.py (ๅฆ‚ๆžœๅญ˜ๅœจ) โ”‚
โ”‚ โ”‚
โ”‚ 3. ่ฟ่กŒ auxiliary: โ”‚
โ”‚ evaluate_auxiliary_metrics( โ”‚
โ”‚ program_output โ”‚
โ”‚ ) โ”‚
โ”‚ โ†’ aux_radius_std_dev, aux_... โ”‚
โ”‚ โ”‚
โ”‚ 4. ๅˆๅนถ็ป“ๆžœ๏ผŒไฟๅญ˜ metrics.json โ”‚
โ”‚ { โ”‚
โ”‚ "combined_score": 2.34, โ”‚
โ”‚ "public_metrics": { โ”‚
โ”‚ "num_circles": 26, โ”‚
โ”‚ "aux_*": ... โ”‚
โ”‚ }, โ”‚
โ”‚ "auxiliary_metric_definitions": { โ”‚
โ”‚ ... โ”‚
โ”‚ } โ”‚
โ”‚ } โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“ ็ป“ๆžœๅ‡†ๅค‡ๅฅฝ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ShinkaEvolve (็ปง็ปญๆผ”ๅŒ–) โ”‚
โ”‚ 1. GET /api/v1/evaluate/{job_id} โ”‚
โ”‚ 2. ่ฏปๅ– combined_score โ”‚
โ”‚ 3. ๆผ”ๅŒ–ๅ†ณ็ญ– (้€‰ๆ‹ฉใ€ไบคๅ‰ใ€ๅ˜ๅผ‚) โ”‚
โ”‚ 4. [ๅฏ้€‰] auxiliary metrics โ†’ LLM โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## ๐ŸŽจ Agent Prompt ้€š็”จๅŒ–
### ๆ—ง Prompt (้’ˆๅฏน circle packing)
```
ๅˆ†ๆž centers ๅ’Œ radii...
```
### ๆ–ฐ Prompt (ไปปๅŠกๆ— ๅ…ณ)
```jinja2
<TASK_CONTEXT>
Task: {{ task_name }}
Primary Metric: {{ primary_metric_name }}
</TASK_CONTEXT>
<PROGRAM_OUTPUT_FORMAT>
The program output structure:
{{ program_output_structure }}
Example:
{{ program_output_example }}
</PROGRAM_OUTPUT_FORMAT>
<REQUIRED_INTERFACE>
def evaluate_auxiliary_metrics(program_output) -> dict:
"""
Args:
program_output: ๆ นๆฎไปปๅŠกไธๅŒ๏ผŒๆ ผๅผไธๅŒ
Circle packing: {"centers": ndarray, "radii": ndarray}
TSP: {"tour": list, "distance": float}
Code: {"code": str, "ast": dict}
Returns:
dict: Auxiliary metrics
"""
# ไฝ ็š„ไปฃ็ ้€‚ๅบ” program_output ็š„ๆ ผๅผ
pass
</REQUIRED_INTERFACE>
```
---
## ๐Ÿ“‹ ๅฎžๆ–ฝ่ทฏๅพ„
### Phase 1: ๅ‘ๅŽๅ…ผๅฎน (1ๅ‘จ)
```python
class EvolutionConfig:
use_eval_service_for_evaluation: bool = False # ๆ–ฐ้€‰้กน
```
- `False`: ไฝฟ็”จๆ—งๆ–นๆกˆ (scheduler โ†’ evaluate.py)
- `True`: ไฝฟ็”จๆ–ฐๆ–นๆกˆ (eval service ่ดŸ่ดฃ)
### Phase 2: ๅฎž็Žฐๆ–ฐ API (2ๅ‘จ)
- [ ] `/api/v1/evaluate` endpoint
- [ ] `run_full_evaluation()` pipeline
- [ ] ้€š็”จ `load_program_output()`
- [ ] ้€š็”จ `run_dynamic_auxiliary()`
### Phase 3: ๆ›ดๆ–ฐ Agent (1ๅ‘จ)
- [ ] ้€š็”จๅŒ– prompt ๆจกๆฟ
- [ ] ๆต‹่ฏ•ไธๅŒไปปๅŠก
### Phase 4: ๅ…จ้ขๅˆ‡ๆข (1ๅ‘จ)
- [ ] ้ป˜่ฎคๅฏ็”จๆ–ฐๆจกๅผ
- [ ] ๅบŸๅผƒๆ—งๆ–นๆกˆ
- [ ] ๆ–‡ๆกฃๆ›ดๆ–ฐ
---
## ๐ŸŽฏ ้ชŒ่ฏๆ ‡ๅ‡†
### ้€š็”จๆ€งๆต‹่ฏ•
ๅœจ 3 ไธชไธๅŒไปปๅŠกไธŠ้ชŒ่ฏ๏ผš
1. **Circle Packing** (ๅ‡ ไฝ•)
```python
program_output = {"centers": ndarray, "radii": ndarray}
```
2. **TSP** (็ป„ๅˆ)
```python
program_output = {"tour": [0,3,1,4,2], "distance": 142.5}
```
3. **Code Optimization** (็จ‹ๅบ)
```python
program_output = {
"code": "def foo(): ...",
"ast": {...},
"runtime": 0.05
}
```
โœ… ๅฆ‚ๆžœๅŒๆ ท็š„ๆžถๆž„้€‚็”จไบŽๆ‰€ๆœ‰ 3 ไธชไปปๅŠก โ†’ ้ชŒ่ฏ้€š่ฟ‡
---
## ๐Ÿ’ก ๅ…ณ้”ฎไผ˜ๅŠฟ
### vs ๅŽŸ่ฎพ่ฎก
| ๆ–น้ข | ๅŽŸ่ฎพ่ฎก | ๆ–ฐ่ฎพ่ฎก |
|------|--------|--------|
| **ๆณ›ๅŒ–ๆ€ง** | โŒ ๅช้€‚็”จ circle packing | โœ… ้€‚็”จไปปๆ„ไปปๅŠก |
| **่Œ่ดฃ** | โŒ ShinkaEvolve ่ดŸ่ดฃ่ฏ„ไผฐ | โœ… Eval Service ่ดŸ่ดฃ |
| **ๆ•ˆ็އ** | โŒ ้œ€่ฆ้‡ๆ–ฐๅŠ ่ฝฝๆ•ฐๆฎ | โœ… ไธ€ๆฌกๅฎŒๆˆ |
| **ๅฏ็ปดๆŠคๆ€ง** | โŒ ้€ป่พ‘ๅˆ†ๆ•ฃ | โœ… ่Œ่ดฃๆธ…ๆ™ฐ |
---
## ๐Ÿ“„ ๆ–‡ๆกฃไฝ็ฝฎ
- ่ฏฆ็ป†่ฎพ่ฎก: `docs/eval_service_redesign_v2.md`
- ๅŽŸๅˆ†ๆž: `docs/eval_service_metrics_analysis.md`
- ๅŽŸ่ฎกๅˆ’: `docs/eval_service_integration_plan.md` (ๅทฒ่ฟ‡ๆ—ถ)