shinka-backup / docs /evaluation_agent_design.md
JustinTX's picture
Add files using upload-large-folder tool
1556404 verified
# Evaluation Agent ่ฎพ่ฎกๆ–นๆกˆ
## ๐Ÿ“‹ ๅฏ่กŒๆ€งๅˆ†ๆž
**็ป“่ฎบ๏ผšๅฎŒๅ…จๅฏ่กŒ๏ผ** ๅฐ†evaluation่„šๆœฌๆ”น้€ ๆˆagentไธไป…ๆŠ€ๆœฏไธŠๅฏ่กŒ๏ผŒ่€Œไธ”ๅฏไปฅๆ˜พ่‘—ๅขžๅผบ็ณป็ปŸ็š„้€‚ๅบ”ๆ€งๅ’Œๆ™บ่ƒฝๅŒ–็จ‹ๅบฆใ€‚
---
## ๐Ÿ—๏ธ ๅฝ“ๅ‰ๆžถๆž„ๅˆ†ๆž
### ๅฝ“ๅ‰Evaluationๅทฅไฝœๆต
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ EvolutionRunner (ๆŽงๅˆถๅ™จ) โ”‚
โ”‚ โ”œโ”€ ็”Ÿๆˆๆ–ฐไปฃ็  (gen_N/main.py) โ”‚
โ”‚ โ”œโ”€ ๆไบคjobๅˆฐJobScheduler โ”‚
โ”‚ โ””โ”€ ็ญ‰ๅพ…็ป“ๆžœ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ JobScheduler ๆ‰ง่กŒๅ‘ฝไปค: โ”‚
โ”‚ python evaluate_with_auxiliary.py \ โ”‚
โ”‚ --program_path gen_N/main.py \ โ”‚
โ”‚ --results_dir gen_N/results โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Evaluation Script (็‹ฌ็ซ‹่ฟ›็จ‹) โ”‚
โ”‚ โ”œโ”€ ๅŠ ่ฝฝ็จ‹ๅบ โ”‚
โ”‚ โ”œโ”€ ่ฟ่กŒๅฎž้ชŒ (run_packing) โ”‚
โ”‚ โ”œโ”€ ้ชŒ่ฏ็ป“ๆžœ (validate_packing) โ”‚
โ”‚ โ”œโ”€ ่ฎก็ฎ—metrics (ๅ›บๅฎš็š„7ไธชauxiliary metrics) โ”‚
โ”‚ โ”œโ”€ ็”Ÿๆˆๆ–‡ๆœฌๅ้ฆˆ โ”‚
โ”‚ โ””โ”€ ไฟๅญ˜็ป“ๆžœๅˆฐๆ–‡ไปถ: โ”‚
โ”‚ โ€ข metrics.json โ”‚
โ”‚ โ€ข correct.json โ”‚
โ”‚ โ€ข extra.npz โ”‚
โ”‚ โ€ข auxiliary_analysis.json โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ EvolutionRunner ่ฏปๅ–็ป“ๆžœ โ”‚
โ”‚ โ”œโ”€ ่งฃๆž metrics.json โ”‚
โ”‚ โ”œโ”€ ๆๅ– combined_score, public_metrics โ”‚
โ”‚ โ”œโ”€ ๅ†™ๅ…ฅๆ•ฐๆฎๅบ“ (ProgramDatabase) โ”‚
โ”‚ โ””โ”€ ็”จไบŽ้€‰ๆ‹ฉไธ‹ไธ€ไปฃ็ˆถ็จ‹ๅบ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### ๅ…ณ้”ฎๆŽฅๅฃๅฅ‘็บฆ
**่พ“ๅ…ฅๆŽฅๅฃ** (ๅ‘ฝไปค่กŒๅ‚ๆ•ฐ):
```python
--program_path: str # ่ฆ่ฏ„ไผฐ็š„็จ‹ๅบ่ทฏๅพ„
--results_dir: str # ็ป“ๆžœไฟๅญ˜็›ฎๅฝ•
--aux_config: str # (ๅฏ้€‰) ่พ…ๅŠฉ่ฏ„ไผฐ้…็ฝฎ
```
**่พ“ๅ‡บๆŽฅๅฃ** (ๆ–‡ไปถ็ณป็ปŸ):
```json
# metrics.json
{
"combined_score": 2.635, # ไธป่ฏ„ๅˆ† (ๅฟ…้กป)
"public": { # ๅ…ฌๅผ€ๆŒ‡ๆ ‡ (LLMๅฏ่ง)
"centers_str": "...",
"num_circles": 26,
"aux_packing_efficiency": 0.842,
"aux_gap_analysis": 0.756,
...
},
"private": { # ็งๆœ‰ๆŒ‡ๆ ‡ (ไป…่ฎฐๅฝ•)
"reported_sum_of_radii": 2.635
},
"text_feedback": "..." # (ๅฏ้€‰) ๆ–‡ๆœฌๅ้ฆˆ
}
# correct.json
{
"correct": true,
"error": null
}
```
**ๆ•ฐๆฎๅบ“Schema** (Program่กจ):
```python
@dataclass
class Program:
# ่บซไปฝๆ ‡่ฏ†
id: str
code: str
generation: int
parent_id: Optional[str]
# ่ฏ„ไผฐ็ป“ๆžœ (็”ฑevaluationๅ†™ๅ…ฅ)
combined_score: float
public_metrics: Dict[str, Any]
private_metrics: Dict[str, Any]
text_feedback: str
correct: bool
# ่พ…ๅŠฉๆ•ฐๆฎ
embedding: List[float]
metadata: Dict[str, Any]
# ่ฟ›ๅŒ–ๅ…ณ็ณป
archive_inspiration_ids: List[str]
top_k_inspiration_ids: List[str]
children_count: int
```
---
## ๐Ÿค– AgentๅŒ–ๆ”น้€ ๆ–นๆกˆ
### ๆ ธๅฟƒ่ฎพ่ฎก็†ๅฟต
**Agent โ‰  ่„šๆœฌ็š„ๅŒบๅˆซ:**
1. **่‡ชไธปๅ†ณ็ญ–**: Agent่ƒฝๆ นๆฎcontextๅ†ณๅฎšๅˆ†ๆž็ญ–็•ฅ
2. **ๅŠจๆ€ๅทฅๅ…ทไฝฟ็”จ**: Agent่ƒฝ่ฐƒ็”จไธๅŒๅทฅๅ…ทใ€็”Ÿๆˆๆ–ฐไปฃ็ 
3. **ๅކๅฒๆ„Ÿ็Ÿฅ**: Agent่ƒฝ่ฎฟ้—ฎๆ•ฐๆฎๅบ“ไบ†่งฃ่ฟ›ๅŒ–ๅކๅฒ
4. **ๅ…ƒๅญฆไน **: Agent่ƒฝๆ”น่ฟ›่‡ชๅทฑ็š„่ฏ„ไผฐ็ญ–็•ฅ
### Agentๆžถๆž„่ฎพ่ฎก
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ EvaluationAgent (ไธปๆŽงๅˆถๅ™จ) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Core Components: โ”‚ โ”‚
โ”‚ โ”‚ โ€ข LLM (decision maker) โ”‚ โ”‚
โ”‚ โ”‚ โ€ข Tool Registry (ๅฏ่ฐƒ็”จ็š„ๅทฅๅ…ท้›†) โ”‚ โ”‚
โ”‚ โ”‚ โ€ข Database Access (่ฏปๅ†™ๅކๅฒๆ•ฐๆฎ) โ”‚ โ”‚
โ”‚ โ”‚ โ€ข Code Executor (ๅฎ‰ๅ…จๆ‰ง่กŒ็”Ÿๆˆ็š„ไปฃ็ ) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Workflow: โ”‚ โ”‚
โ”‚ โ”‚ 1. ๆŽฅๆ”ถ่ฏ„ไผฐ่ฏทๆฑ‚ (program_path, results_dir) โ”‚ โ”‚
โ”‚ โ”‚ 2. ๆŸฅ่ฏขๆ•ฐๆฎๅบ“่Žทๅ–context โ”‚ โ”‚
โ”‚ โ”‚ 3. LLM่ง„ๅˆ’่ฏ„ไผฐ็ญ–็•ฅ โ”‚ โ”‚
โ”‚ โ”‚ 4. ๆ‰ง่กŒ่ฏ„ไผฐๆญฅ้ชค (่ฐƒ็”จๅทฅๅ…ท/็”Ÿๆˆไปฃ็ ) โ”‚ โ”‚
โ”‚ โ”‚ 5. ่šๅˆ็ป“ๆžœๅนถ็”Ÿๆˆๅ้ฆˆ โ”‚ โ”‚
โ”‚ โ”‚ 6. (ๅฏ้€‰) ๆ›ดๆ–ฐๆ•ฐๆฎๅบ“ๅ…ƒไฟกๆฏ โ”‚ โ”‚
โ”‚ โ”‚ 7. ไฟๅญ˜ๆ ‡ๅ‡†่พ“ๅ‡บๆ–‡ไปถ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Agentๅฏ็”จ็š„ๅทฅๅ…ท (Tools):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Ground Truth โ”‚ โ”‚ Auxiliary โ”‚ โ”‚ Dynamic Metric โ”‚
โ”‚ Evaluation โ”‚ โ”‚ Metrics โ”‚ โ”‚ Generator โ”‚
โ”‚ โ€ข ่ฟ่กŒ็จ‹ๅบ โ”‚ โ”‚ โ€ข ้ข„ๅฎšไน‰ๆŒ‡ๆ ‡ โ”‚ โ”‚ โ€ข LLM็”Ÿๆˆไปฃ็  โ”‚
โ”‚ โ€ข ้ชŒ่ฏ็บฆๆŸ โ”‚ โ”‚ โ€ข ๆณจๅ†Œ็ณป็ปŸ โ”‚ โ”‚ โ€ข ็ผ–่ฏ‘ๅนถๆ‰ง่กŒ โ”‚
โ”‚ โ€ข ่ฎก็ฎ—ไธปๅˆ†ๆ•ฐ โ”‚ โ”‚ โ”‚ โ”‚ โ€ข ๅฎ‰ๅ…จๆฒ™็ฎฑ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Database Query โ”‚ โ”‚ Visualization โ”‚ โ”‚ Meta Analysis โ”‚
โ”‚ โ€ข ๆŸฅ่ฏขๅކๅฒ โ”‚ โ”‚ โ€ข ็”Ÿๆˆๅ›พ่กจ โ”‚ โ”‚ โ€ข ่ถ‹ๅŠฟๅˆ†ๆž โ”‚
โ”‚ โ€ข ็ปŸ่ฎกๅˆ†ๆž โ”‚ โ”‚ โ€ข ไฟๅญ˜ๅฏ่ง†ๅŒ– โ”‚ โ”‚ โ€ข ็ญ–็•ฅๆŽจ่ โ”‚
โ”‚ โ€ข ๅฏนๆฏ”็จ‹ๅบ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
ๆ•ฐๆฎๅบ“่ฎฟ้—ฎๆƒ้™:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Database Interface (ProgramDatabase) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ READ Operations (Agentๅฏ็”จ): โ”‚ โ”‚
โ”‚ โ”‚ โ€ข get_all_programs() โ”‚ โ”‚
โ”‚ โ”‚ โ€ข get_programs_by_generation(gen) โ”‚ โ”‚
โ”‚ โ”‚ โ€ข get_top_programs(n, metric) โ”‚ โ”‚
โ”‚ โ”‚ โ€ข get_best_program(metric) โ”‚ โ”‚
โ”‚ โ”‚ โ€ข get_program(id) โ”‚ โ”‚
โ”‚ โ”‚ โ€ข ่‡ชๅฎšไน‰SQLๆŸฅ่ฏข (ๅ—้™) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ WRITE Operations (่ฐจๆ…Žไฝฟ็”จ): โ”‚ โ”‚
โ”‚ โ”‚ โ€ข ๅช่ƒฝๅ†™ๅ…ฅmetadataๅญ—ๆฎต โ”‚ โ”‚
โ”‚ โ”‚ โ€ข ไธ่ƒฝไฟฎๆ”น combined_score, correct ็ญ‰ๆ ธๅฟƒๅญ—ๆฎต โ”‚ โ”‚
โ”‚ โ”‚ โ€ข ๅฏไปฅๆทปๅŠ ้ขๅค–็š„ๅˆ†ๆž็ป“ๆžœๅˆฐmetadata โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## ๐Ÿ”Œ Agentๅฏนๅค–ๆŽฅๅฃ่ฎพ่ฎก
### 1. ๅ‘ฝไปค่กŒๆŽฅๅฃ (ไฟๆŒๅ…ผๅฎน)
```bash
# ๅŸบๆœฌๆŽฅๅฃ (ไธŽๅฝ“ๅ‰ๅฎŒๅ…จๅ…ผๅฎน)
python evaluation_agent.py \
--program_path gen_42/main.py \
--results_dir gen_42/results
# ๆ‰ฉๅฑ•ๆŽฅๅฃ (ๆ–ฐๅขžๅŠŸ่ƒฝ)
python evaluation_agent.py \
--program_path gen_42/main.py \
--results_dir gen_42/results \
--db_path path/to/evolution.sqlite \ # Agentๅฏ่ฎฟ้—ฎๆ•ฐๆฎๅบ“
--agent_mode adaptive \ # ่ฏ„ไผฐๆจกๅผ: static|adaptive|exploratory
--enable_dynamic_metrics \ # ๅ…่ฎธ็”Ÿๆˆๆ–ฐmetrics
--feedback_style detailed # ๅ้ฆˆ้ฃŽๆ ผ: minimal|normal|detailed
```
### 2. Python APIๆŽฅๅฃ
```python
from shinka.evaluation import EvaluationAgent, AgentConfig
# ้…็ฝฎAgent
agent_config = AgentConfig(
# LLM้…็ฝฎ
llm_model="native-gemini-2.5-pro",
llm_temperature=0.7,
# ่ฏ„ไผฐๆจกๅผ
mode="adaptive", # static | adaptive | exploratory
# ๅทฅๅ…ท่ฎฟ้—ฎๆƒ้™
enable_ground_truth=True, # ๅฟ…้กป
enable_auxiliary_metrics=True, # ้ข„ๅฎšไน‰่พ…ๅŠฉๆŒ‡ๆ ‡
enable_dynamic_metrics=True, # LLM็”Ÿๆˆๆ–ฐๆŒ‡ๆ ‡
enable_database_read=True, # ่ฏปๅ–ๅކๅฒๆ•ฐๆฎ
enable_database_write_metadata=False, # ๅ†™ๅ…ฅๅ…ƒๆ•ฐๆฎ
# ๅฎ‰ๅ…จ้…็ฝฎ
code_execution_timeout=30, # ็”Ÿๆˆไปฃ็ ๆ‰ง่กŒ่ถ…ๆ—ถ
max_tool_calls=20, # ๆœ€ๅคงๅทฅๅ…ท่ฐƒ็”จๆฌกๆ•ฐ
sandboxed_execution=True, # ๆฒ™็ฎฑๆ‰ง่กŒ
# ่พ“ๅ‡บ้…็ฝฎ
generate_text_feedback=True,
save_detailed_analysis=True,
visualization=True,
)
# ๅˆ›ๅปบAgent
agent = EvaluationAgent(
config=agent_config,
db_path="path/to/evolution.sqlite" # ๅฏ้€‰
)
# ๆ‰ง่กŒ่ฏ„ไผฐ
metrics, correct, error = agent.evaluate(
program_path="gen_42/main.py",
results_dir="gen_42/results"
)
# Agentไผš่‡ชๅŠจไฟๅญ˜ๆ ‡ๅ‡†่พ“ๅ‡บๆ–‡ไปถ
# - metrics.json
# - correct.json
# - auxiliary_analysis.json
# - (ๅฏ้€‰) agent_reasoning.json # Agent็š„ๅ†ณ็ญ–่ฟ‡็จ‹
```
### 3. EvolutionRunner้›†ๆˆๆŽฅๅฃ
```python
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig
from shinka.evaluation import EvaluationAgentConfig # ๆ–ฐๅขž
# ้…็ฝฎjobไฝฟ็”จAgent่ฏ„ไผฐๅ™จ
job_config = LocalJobConfig(
eval_program_path="shinka/evaluation/agent_main.py", # Agentๅ…ฅๅฃ
extra_cmd_args={
"agent_mode": "adaptive",
"enable_dynamic_metrics": True,
"db_path": "auto", # ่‡ชๅŠจไผ ้€’ๆ•ฐๆฎๅบ“่ทฏๅพ„
}
)
# ๆ•ฐๆฎๅบ“้…็ฝฎ
db_config = DatabaseConfig(
db_path="evolution_db.sqlite",
# ... ๅ…ถไป–้…็ฝฎ
)
# ่ฟ›ๅŒ–้…็ฝฎ
evo_config = EvolutionConfig(
# ... ๅ…ถไป–้…็ฝฎ
use_text_feedback=True, # ๆŽฅๆ”ถAgent็”Ÿๆˆ็š„ๅ้ฆˆ
)
# ่ฟ่กŒๆ—ถ๏ผŒAgentไผš่‡ชๅŠจ่Žทๅพ—:
# 1. ๅฝ“ๅ‰generation็š„็จ‹ๅบ่ทฏๅพ„
# 2. ๆ•ฐๆฎๅบ“่ฎฟ้—ฎๆƒ้™ (้€š่ฟ‡--db_pathๅ‚ๆ•ฐ)
# 3. ๅކๅฒ็จ‹ๅบไฟกๆฏ
runner = EvolutionRunner(
job_config=job_config,
db_config=db_config,
evo_config=evo_config,
)
runner.run()
```
---
## ๐Ÿ› ๏ธ Agentๅทฅๅ…ท็ณป็ปŸ่ฎพ่ฎก
### ๅทฅๅ…ทๆŽฅๅฃ่ง„่Œƒ
```python
from typing import Any, Dict, Optional
from dataclasses import dataclass
@dataclass
class ToolResult:
"""ๅทฅๅ…ทๆ‰ง่กŒ็ป“ๆžœ"""
success: bool
data: Any
error: Optional[str] = None
cost: float = 0.0 # APIๆˆๆœฌ
class Tool:
"""ๅทฅๅ…ทๅŸบ็ฑป"""
name: str
description: str
parameters: Dict[str, Any] # JSON Schema
def execute(self, **kwargs) -> ToolResult:
"""ๆ‰ง่กŒๅทฅๅ…ท้€ป่พ‘"""
raise NotImplementedError
```
### ๆ ธๅฟƒๅทฅๅ…ทๆธ…ๅ•
```python
# ============================================================================
# 1. GROUND TRUTH EVALUATION (ๅฟ…้œ€ๅทฅๅ…ท)
# ============================================================================
class RunProgramTool(Tool):
"""่ฟ่กŒ่ขซ่ฏ„ไผฐ็จ‹ๅบๅนถ่Žทๅ–ๅŽŸๅง‹็ป“ๆžœ"""
name = "run_program"
description = "Execute the program and get raw results (centers, radii, score)"
def execute(self, program_path: str, num_runs: int = 1) -> ToolResult:
# ่ฐƒ็”จ run_shinka_eval ็š„ๅบ•ๅฑ‚้€ป่พ‘
# ่ฟ”ๅ›ž: centers, radii, reported_score
pass
class ValidateResultsTool(Tool):
"""้ชŒ่ฏ็จ‹ๅบ่พ“ๅ‡บๆ˜ฏๅฆๆปก่ถณ็บฆๆŸ"""
name = "validate_results"
description = "Validate if results satisfy all constraints"
def execute(self, centers, radii) -> ToolResult:
# ่ฐƒ็”จ adapted_validate_packing
# ่ฟ”ๅ›ž: is_valid, error_message
pass
# ============================================================================
# 2. AUXILIARY METRICS (้ข„ๅฎšไน‰ๅˆ†ๆžๅทฅๅ…ท)
# ============================================================================
class ComputeMetricTool(Tool):
"""่ฎก็ฎ—้ข„ๅฎšไน‰็š„่พ…ๅŠฉๆŒ‡ๆ ‡"""
name = "compute_metric"
description = "Compute a predefined auxiliary metric"
parameters = {
"metric_name": {
"type": "string",
"enum": ["packing_efficiency", "gap_analysis", "edge_utilization", ...]
}
}
def execute(self, metric_name: str, centers, radii) -> ToolResult:
# ่ฐƒ็”จ METRIC_REGISTRY.get(metric_name)
pass
class ListMetricsTool(Tool):
"""ๅˆ—ๅ‡บๆ‰€ๆœ‰ๅฏ็”จ็š„้ข„ๅฎšไน‰ๆŒ‡ๆ ‡"""
name = "list_metrics"
def execute(self) -> ToolResult:
return ToolResult(
success=True,
data=METRIC_REGISTRY.list_metrics()
)
# ============================================================================
# 3. DATABASE ACCESS (ๅކๅฒๆ•ฐๆฎๅทฅๅ…ท)
# ============================================================================
class QueryDatabaseTool(Tool):
"""ๆŸฅ่ฏขๆ•ฐๆฎๅบ“่Žทๅ–ๅކๅฒ็จ‹ๅบไฟกๆฏ"""
name = "query_database"
description = "Query historical programs from database"
parameters = {
"query_type": {
"type": "string",
"enum": ["top_programs", "by_generation", "best_program", "all"]
},
"filters": {
"type": "object",
"properties": {
"metric": {"type": "string"},
"n": {"type": "integer"},
"generation": {"type": "integer"}
}
}
}
def execute(self, query_type: str, filters: Dict) -> ToolResult:
if query_type == "top_programs":
programs = self.db.get_top_programs(
n=filters.get("n", 10),
metric=filters.get("metric", "combined_score")
)
elif query_type == "by_generation":
programs = self.db.get_programs_by_generation(filters["generation"])
# ...
return ToolResult(
success=True,
data=[p.to_dict() for p in programs]
)
class CompareWithHistoryTool(Tool):
"""ๅฏนๆฏ”ๅฝ“ๅ‰็จ‹ๅบไธŽๅކๅฒ็จ‹ๅบ"""
name = "compare_with_history"
def execute(self, current_metrics: Dict, comparison_type: str) -> ToolResult:
# comparison_type: "best" | "parent" | "generation_average"
# ่ฟ”ๅ›žๅฏนๆฏ”ๅˆ†ๆž็ป“ๆžœ
pass
# ============================================================================
# 4. DYNAMIC METRIC GENERATION (LLM็”Ÿๆˆๆ–ฐๆŒ‡ๆ ‡)
# ============================================================================
class GenerateMetricCodeTool(Tool):
"""่ฎฉLLM็”Ÿๆˆๆ–ฐ็š„่ฏ„ไผฐๆŒ‡ๆ ‡ไปฃ็ """
name = "generate_metric_code"
description = "Generate Python code for a new evaluation metric"
parameters = {
"metric_purpose": {"type": "string"},
"inspiration_from": {"type": "string"} # ๅ‚่€ƒๅทฒๆœ‰ๆŒ‡ๆ ‡
}
def execute(self, metric_purpose: str, inspiration_from: str = None) -> ToolResult:
# ่ฐƒ็”จLLM็”Ÿๆˆๆ–ฐmetricไปฃ็ 
# ไฝฟ็”จ LLMGeneratedMetric ๆก†ๆžถ
prompt = f"""
Generate a Python function to compute a new auxiliary metric for circle packing.
Purpose: {metric_purpose}
Requirements:
1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
2. Return MetricResult with name, value, interpretation, description, details
3. Use numpy for computations
4. Handle edge cases gracefully
Example structure:
```python
def my_metric(centers, radii):
# Your analysis logic here
score = ...
return MetricResult(
name="my_metric",
value=float(score),
interpretation="higher_better",
description="What this metric measures",
details={{"key": "value"}}
)
```
"""
llm_response = self.llm.query(prompt)
code = extract_code_from_response(llm_response)
return ToolResult(
success=True,
data={"code": code, "cost": llm_response.cost}
)
class CompileAndTestMetricTool(Tool):
"""็ผ–่ฏ‘ๅนถๆต‹่ฏ•LLM็”Ÿๆˆ็š„ๆŒ‡ๆ ‡ไปฃ็ """
name = "compile_and_test_metric"
def execute(self, code: str, test_data: Dict) -> ToolResult:
metric = LLMGeneratedMetric(
name="llm_metric",
code=code,
description="LLM generated metric",
interpretation="higher_better"
)
if not metric.compile():
return ToolResult(success=False, error="Compilation failed")
# ๆต‹่ฏ•ๆ‰ง่กŒ
try:
result = metric.evaluate(
centers=test_data["centers"],
radii=test_data["radii"]
)
return ToolResult(success=True, data=result)
except Exception as e:
return ToolResult(success=False, error=str(e))
# ============================================================================
# 5. VISUALIZATION & ANALYSIS (ๅˆ†ๆžๅทฅๅ…ท)
# ============================================================================
class VisualizeTool(Tool):
"""็”Ÿๆˆๅฏ่ง†ๅŒ–"""
name = "visualize"
def execute(self, vis_type: str, data: Dict, output_path: str) -> ToolResult:
# vis_type: "packing" | "metrics_trend" | "comparison"
pass
class StatisticalAnalysisTool(Tool):
"""็ปŸ่ฎกๅˆ†ๆžๅทฅๅ…ท"""
name = "statistical_analysis"
def execute(self, data: List[float], analysis_type: str) -> ToolResult:
# analysis_type: "trend" | "distribution" | "correlation"
pass
# ============================================================================
# 6. META OPERATIONS (ๅ…ƒๆ“ไฝœ)
# ============================================================================
class UpdateMetadataTool(Tool):
"""ๆ›ดๆ–ฐ็จ‹ๅบ็š„metadataๅญ—ๆฎต"""
name = "update_metadata"
description = "Add analysis results to program metadata (write to DB)"
def execute(self, program_id: str, metadata: Dict) -> ToolResult:
# ไป…ๅ…่ฎธๅ†™ๅ…ฅmetadataๅญ—ๆฎต๏ผŒไธ่ƒฝไฟฎๆ”นๆ ธๅฟƒ่ฏ„ไผฐๅญ—ๆฎต
program = self.db.get_program(program_id)
if program:
program.metadata.update(metadata)
# ๅ†™ๅ›žๆ•ฐๆฎๅบ“
# ๆณจๆ„๏ผš้œ€่ฆๆ‰ฉๅฑ•ProgramDatabaseๆทปๅŠ update_metadataๆ–นๆณ•
pass
```
---
## ๐Ÿง  Agentๅ†ณ็ญ–ๆต็จ‹
### Mode 1: Static Mode (ๅ…ผๅฎนๆจกๅผ)
```python
def static_evaluation(agent, program_path, results_dir):
"""
ๅฎŒๅ…จๅ…ผๅฎน็Žฐๆœ‰evaluation่„šๆœฌ็š„่กŒไธบ
"""
# 1. ่ฟ่กŒ็จ‹ๅบ
result = agent.tools["run_program"].execute(program_path)
centers, radii, score = result.data
# 2. ้ชŒ่ฏ็ป“ๆžœ
validation = agent.tools["validate_results"].execute(centers, radii)
correct = validation.data["is_valid"]
# 3. ่ฎก็ฎ—้ข„ๅฎšไน‰auxiliary metrics
auxiliary_results = {}
for metric_name in agent.config.enabled_metrics:
metric_result = agent.tools["compute_metric"].execute(
metric_name, centers, radii
)
auxiliary_results[metric_name] = metric_result.data.value
# 4. ็”Ÿๆˆๆ ‡ๅ‡†ๅ้ฆˆ
feedback = generate_standard_feedback(auxiliary_results, score)
# 5. ไฟๅญ˜็ป“ๆžœ
metrics = {
"combined_score": score,
"public": {
"centers_str": format_centers_string(centers),
"num_circles": len(centers),
**{f"aux_{k}": v for k, v in auxiliary_results.items()}
},
"private": {"reported_sum_of_radii": score},
"text_feedback": feedback
}
save_metrics(results_dir, metrics, correct)
return metrics, correct
```
### Mode 2: Adaptive Mode (ๆ™บ่ƒฝๆจกๅผ)
```python
def adaptive_evaluation(agent, program_path, results_dir, db_path):
"""
Agentๆ นๆฎcontextๆ™บ่ƒฝๅ†ณ็ญ–่ฏ„ไผฐ็ญ–็•ฅ
"""
# 1. ่Žทๅ–context
context = agent.gather_context(program_path, db_path)
# 2. LLM่ง„ๅˆ’่ฏ„ไผฐ็ญ–็•ฅ
plan = agent.llm.plan_evaluation(context)
# ็คบไพ‹plan:
# {
# "steps": [
# {"action": "run_program", "params": {...}},
# {"action": "query_database", "params": {"query_type": "best_program"}},
# {"action": "compute_metric", "params": {"metric_name": "packing_efficiency"}},
# {"action": "compare_with_history", "params": {"comparison_type": "best"}},
# {"action": "generate_feedback", "params": {...}}
# ]
# }
# 3. ๆ‰ง่กŒplan
execution_log = []
for step in plan["steps"]:
tool = agent.tools[step["action"]]
result = tool.execute(**step["params"])
execution_log.append(result)
# ๅฆ‚ๆžœๆŸๆญฅๅคฑ่ดฅ๏ผŒLLMๅฏไปฅ่ฐƒๆ•ด็ญ–็•ฅ
if not result.success:
plan = agent.llm.replan(plan, execution_log, result.error)
# 4. LLM่šๅˆ็ป“ๆžœๅนถ็”Ÿๆˆๅ้ฆˆ
final_metrics, feedback = agent.llm.aggregate_results(execution_log, context)
# 5. ไฟๅญ˜็ป“ๆžœ (ไฟ่ฏๆŽฅๅฃๅ…ผๅฎนๆ€ง)
save_metrics(results_dir, final_metrics, correct)
# 6. (ๅฏ้€‰) ไฟๅญ˜AgentๆŽจ็†่ฟ‡็จ‹
save_agent_reasoning(results_dir, plan, execution_log)
return final_metrics, correct
```
### Mode 3: Exploratory Mode (ๆŽข็ดขๆจกๅผ)
```python
def exploratory_evaluation(agent, program_path, results_dir, db_path):
"""
AgentไธปๅŠจๆŽข็ดขๆ–ฐ็š„่ฏ„ไผฐๆ–นๆณ•
"""
# 1. ๆ ‡ๅ‡†่ฏ„ไผฐ
base_metrics, correct = adaptive_evaluation(agent, program_path, results_dir, db_path)
# 2. ๅˆ†ๆžๅކๅฒ่ถ‹ๅŠฟ
trend_analysis = agent.tools["statistical_analysis"].execute(
data=get_historical_scores(agent.db),
analysis_type="trend"
)
# 3. ๅฆ‚ๆžœๅ‘็Žฐ่ฏ„ไผฐ็›ฒ็‚น๏ผŒ็”Ÿๆˆๆ–ฐmetric
if agent.detect_evaluation_gap(trend_analysis):
# LLM็”Ÿๆˆๆ–ฐmetricไปฃ็ 
new_metric_code = agent.tools["generate_metric_code"].execute(
metric_purpose="Identify patterns missed by existing metrics"
)
# ็ผ–่ฏ‘ๅนถๆต‹่ฏ•
test_result = agent.tools["compile_and_test_metric"].execute(
code=new_metric_code.data["code"],
test_data={"centers": centers, "radii": radii}
)
if test_result.success:
# ๆณจๅ†Œๆ–ฐmetricๅˆฐๅ…จๅฑ€registry
register_new_metric(new_metric_code.data["code"])
# ้‡ๆ–ฐ่ฏ„ไผฐๅŒ…ๅซๆ–ฐmetric
extended_metrics = compute_with_new_metric(centers, radii)
base_metrics["public"].update(extended_metrics)
# 4. ไฟๅญ˜ๆ‰ฉๅฑ•็ป“ๆžœ
save_metrics(results_dir, base_metrics, correct)
return base_metrics, correct
```
---
## ๐Ÿ”’ ๅฎ‰ๅ…จๆ€ง่ฎพ่ฎก
### ไปฃ็ ๆ‰ง่กŒๆฒ™็ฎฑ
```python
class SafeCodeExecutor:
"""ๅฎ‰ๅ…จ็š„ไปฃ็ ๆ‰ง่กŒ็Žฏๅขƒ"""
def __init__(self, timeout=30):
self.timeout = timeout
self.allowed_imports = {
'numpy', 'scipy', 'math', 'statistics'
}
self.forbidden_operations = {
'__import__', 'eval', 'exec', 'compile',
'open', 'file', 'input', 'raw_input'
}
def execute(self, code: str, inputs: Dict) -> Any:
"""ๅœจๅ—้™็Žฏๅขƒไธญๆ‰ง่กŒไปฃ็ """
# 1. ้™ๆ€ๅˆ†ๆžๆฃ€ๆŸฅ
if self.has_forbidden_operations(code):
raise SecurityError("Forbidden operations detected")
# 2. ๅˆ›ๅปบๅ—้™namespace
namespace = {
'np': numpy,
'MetricResult': MetricResult,
# ... ๅชๆไพ›ๅฟ…่ฆ็š„ๆจกๅ—
}
namespace.update(inputs)
# 3. ่ถ…ๆ—ถๆ‰ง่กŒ
with timeout(self.timeout):
exec(code, namespace)
return namespace
```
### ๆ•ฐๆฎๅบ“่ฎฟ้—ฎๆƒ้™ๆŽงๅˆถ
```python
class RestrictedDatabaseAccess:
"""ๅ—้™็š„ๆ•ฐๆฎๅบ“่ฎฟ้—ฎๆŽฅๅฃ"""
def __init__(self, db: ProgramDatabase):
self.db = db
self.read_only_methods = [
'get_all_programs', 'get_programs_by_generation',
'get_top_programs', 'get_best_program', 'get_program'
]
self.write_allowed_fields = ['metadata'] # ๅช่ƒฝๅ†™metadata
def __getattr__(self, name):
if name in self.read_only_methods:
return getattr(self.db, name)
else:
raise PermissionError(f"Method {name} not allowed for agent")
def update_metadata(self, program_id: str, metadata: Dict):
"""ๅ”ฏไธ€ๅ…่ฎธ็š„ๅ†™ๅ…ฅๆ“ไฝœ"""
program = self.db.get_program(program_id)
if program:
program.metadata.update(metadata)
# ้œ€่ฆๅœจProgramDatabaseไธญๆทปๅŠ ๆญคๆ–นๆณ•
self.db.update_program_metadata(program_id, program.metadata)
```
---
## ๐Ÿ“Š AgentไธŽๅค–็•Œ็š„ๆ•ฐๆฎๆต
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ EvolutionRunner (ไธป็ณป็ปŸ) โ”‚
โ”‚ โ”‚
โ”‚ [ๆฏไธ€ไปฃ่ฟ›ๅŒ–] โ”‚
โ”‚ โ”œโ”€ ็”Ÿๆˆๆ–ฐไปฃ็ : gen_N/main.py โ”‚
โ”‚ โ”œโ”€ ่ฐƒ็”จAgent่ฏ„ไผฐ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ โ–ผ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ โ”‚ EvaluationAgent โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ (็‹ฌ็ซ‹่ฟ›็จ‹) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ ่พ“ๅ…ฅ: โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ€ข program_path โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ€ข results_dir โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ€ข db_path (ๅฏ้€‰) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ Agentๅ†…้ƒจๆต็จ‹: โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ 1. ๅŠ ่ฝฝ็จ‹ๅบ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ 2. ่ฟ่กŒ่ฏ„ไผฐ โ”‚ โ”‚
โ”‚ โ”‚ ่ฏปๅ–ๆ•ฐๆฎๅบ“ โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผ 3. ๆŸฅ่ฏขDBๅކๅฒ โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ 4. LLM่ง„ๅˆ’ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ 5. ๅทฅๅ…ท่ฐƒ็”จ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ 6. ่šๅˆ็ป“ๆžœ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ (ๅฏ้€‰)ๅ†™metadata โ—„โ”€โ”€โ”€โ”€โ”€โ”ผ 7. ไฟๅญ˜่พ“ๅ‡บ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ ่พ“ๅ‡บๆ–‡ไปถ: โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ€ข metrics.json โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ€ข correct.json โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ€ข agent_log.jsonโ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”œโ”€ ่ฏปๅ–่ฏ„ไผฐ็ป“ๆžœ โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚ โ€ข combined_score โ”‚
โ”‚ โ”‚ โ€ข public_metrics (ๅซaux metrics) โ”‚
โ”‚ โ”‚ โ€ข text_feedback โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ”œโ”€ ๅ†™ๅ…ฅๆ•ฐๆฎๅบ“ (ProgramDatabase) โ”‚
โ”‚ โ”‚ โ€ข ๅˆ›ๅปบๆ–ฐProgram่ฎฐๅฝ• โ”‚
โ”‚ โ”‚ โ€ข ไฟๅญ˜ๆ‰€ๆœ‰metrics โ”‚
โ”‚ โ”‚ โ€ข ๆ›ดๆ–ฐarchive โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ””โ”€ ้€‰ๆ‹ฉ็ˆถไปฃ โ†’ ไธ‹ไธ€ไปฃ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
ๆ•ฐๆฎๅบ“ Schema (ๅ…ฑไบซ็Šถๆ€):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ SQLite: evolution_db.sqlite โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ programs ่กจ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ id (gen_N) โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ code โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ generation (N) โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ combined_score โ—„โ”€โ”€ EvolutionRunnerๅ†™ๅ…ฅ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ public_metrics โ—„โ”€โ”€ EvolutionRunnerๅ†™ๅ…ฅ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ text_feedback โ—„โ”€โ”€ EvolutionRunnerๅ†™ๅ…ฅ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€ correct โ—„โ”€โ”€ EvolutionRunnerๅ†™ๅ…ฅ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€ metadata โ—„โ”€โ”€ Agentๅฏๅ†™ๅ…ฅ (ๅฏ้€‰) โ”‚ โ”‚
โ”‚ โ”‚ { โ”‚ โ”‚
โ”‚ โ”‚ "agent_analysis": {...}, โ”‚ โ”‚
โ”‚ โ”‚ "custom_metrics": {...}, โ”‚ โ”‚
โ”‚ โ”‚ "evaluation_reasoning": "..." โ”‚ โ”‚
โ”‚ โ”‚ } โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ”‚ Agentๅฏ่ฏปๅ–ๅ…จ้ƒจๅކๅฒๆ•ฐๆฎ๏ผŒไฝ†ๅช่ƒฝๅ†™ๅ…ฅmetadataๅญ—ๆฎต โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## ๐ŸŽฏ Agentๅฏนๅค–ๆŽฅๅฃๆ€ป็ป“
### ๅฟ…้œ€ๆŽฅๅฃ (ไฟๆŒๅ…ผๅฎน)
```python
# 1. ๅ‘ฝไปค่กŒๅ‚ๆ•ฐๆŽฅๅฃ
--program_path: str # ๅฟ…้œ€
--results_dir: str # ๅฟ…้œ€
# 2. ่พ“ๅ‡บๆ–‡ไปถๆŽฅๅฃ (ๆ ‡ๅ‡†ๅฅ‘็บฆ)
metrics.json: {
"combined_score": float, # ๅฟ…้œ€
"public": dict, # ๅฟ…้œ€
"private": dict, # ๅฏ้€‰
"text_feedback": str # ๅฏ้€‰ (use_text_feedback=Trueๆ—ถ)
}
correct.json: {
"correct": bool, # ๅฟ…้œ€
"error": str | null # ๅฟ…้œ€
}
```
### ๆ‰ฉๅฑ•ๆŽฅๅฃ (Agent็‰นๆ€ง)
```python
# 1. ๆ•ฐๆฎๅบ“่ฎฟ้—ฎๆŽฅๅฃ
--db_path: str # ๅฏ้€‰๏ผŒๆไพ›ๅŽAgentๅฏ่ฎฟ้—ฎๅކๅฒๆ•ฐๆฎ
# 2. Agentๆจกๅผ้…็ฝฎ
--agent_mode: str # static | adaptive | exploratory
--enable_dynamic_metrics: bool
--max_tool_calls: int
# 3. ้ขๅค–่พ“ๅ‡บๆ–‡ไปถ
agent_reasoning.json: { # Agent็š„ๅ†ณ็ญ–่ฟ‡็จ‹ (็”จไบŽ่ฐƒ่ฏ•ๅ’Œๅˆ†ๆž)
"plan": [...],
"execution_log": [...],
"tool_costs": {...},
"total_cost": float
}
auxiliary_analysis.json # ่ฏฆ็ป†็š„่พ…ๅŠฉๅˆ†ๆž (ๅทฒๆœ‰)
visualizations/ # ๅฏ่ง†ๅŒ–ๆ–‡ไปถ (ๅฏ้€‰)
โ”œโ”€ packing_viz.png
โ”œโ”€ metrics_trend.png
โ””โ”€ comparison.png
```
### Python APIๆŽฅๅฃ
```python
# 1. Agent็ฑปๆŽฅๅฃ
class EvaluationAgent:
def __init__(
self,
config: AgentConfig,
db_path: Optional[str] = None
):
pass
def evaluate(
self,
program_path: str,
results_dir: str
) -> Tuple[Dict, bool, Optional[str]]:
"""
่ฟ”ๅ›ž: (metrics, correct, error)
ไธŽ run_shinka_eval ๅฎŒๅ…จๅ…ผๅฎน
"""
pass
# 2. ๅทฅๅ…ทๆŽฅๅฃ (ไพ›Agentๅ†…้ƒจไฝฟ็”จ)
class Tool:
def execute(self, **kwargs) -> ToolResult:
pass
# 3. ๆ•ฐๆฎๅบ“ๆŽฅๅฃๆ‰ฉๅฑ•
class ProgramDatabase:
# ๆ–ฐๅขžๆ–นๆณ•ไพ›Agentไฝฟ็”จ
def update_program_metadata(
self,
program_id: str,
metadata: Dict
) -> bool:
pass
```
---
## ๐Ÿš€ ๅฎž็Žฐ่ทฏ็บฟๅ›พ
### Phase 1: ๅŸบ็ก€Agentๆก†ๆžถ (2-3ๅคฉ)
```
โœ“ 1. ๅˆ›ๅปบ EvaluationAgent ็ฑป้ชจๆžถ
โœ“ 2. ๅฎž็Žฐ Tool ๅŸบ็ฑปๅ’Œๅทฅๅ…ทๆณจๅ†Œ็ณป็ปŸ
โœ“ 3. ้‡ๆž„็Žฐๆœ‰evaluationไปฃ็ ไธบๅทฅๅ…ท
- RunProgramTool
- ValidateResultsTool
- ComputeMetricTool
โœ“ 4. ๅฎž็Žฐ static_mode (ๅฎŒๅ…จๅ…ผๅฎน็Žฐๆœ‰่กŒไธบ)
โœ“ 5. ๅ•ๅ…ƒๆต‹่ฏ•
```
### Phase 2: ๆ•ฐๆฎๅบ“้›†ๆˆ (1-2ๅคฉ)
```
โœ“ 1. ๅˆ›ๅปบ RestrictedDatabaseAccess ๆŽฅๅฃ
โœ“ 2. ๅฎž็Žฐๆ•ฐๆฎๅบ“ๆŸฅ่ฏขๅทฅๅ…ท
- QueryDatabaseTool
- CompareWithHistoryTool
โœ“ 3. ๆ‰ฉๅฑ• ProgramDatabase.update_program_metadata()
โœ“ 4. ้›†ๆˆๆต‹่ฏ•
```
### Phase 3: Adaptive Mode (3-4ๅคฉ)
```
โœ“ 1. ๅฎž็Žฐ LLM planning ้€ป่พ‘
โœ“ 2. Context gathering (ๅކๅฒๆ•ฐๆฎๅˆ†ๆž)
โœ“ 3. ๅŠจๆ€ๅทฅๅ…ท่ฐƒ็”จ
โœ“ 4. ็ป“ๆžœ่šๅˆๅ’Œๅ้ฆˆ็”Ÿๆˆ
โœ“ 5. ็ซฏๅˆฐ็ซฏๆต‹่ฏ•
```
### Phase 4: Dynamic Metrics (2-3ๅคฉ)
```
โœ“ 1. ๅฎž็Žฐ GenerateMetricCodeTool
โœ“ 2. SafeCodeExecutor ๆฒ™็ฎฑ
โœ“ 3. ๅŠจๆ€metricๆณจๅ†Œๅ’Œ้ชŒ่ฏ
โœ“ 4. Exploratory mode ๅฎž็Žฐ
โœ“ 5. ๅฎ‰ๅ…จๆ€งๆต‹่ฏ•
```
### Phase 5: ๅฏ่ง†ๅŒ–ๅ’Œๅˆ†ๆž (1-2ๅคฉ)
```
โœ“ 1. VisualizeTool
โœ“ 2. StatisticalAnalysisTool
โœ“ 3. AgentๆŽจ็†่ฟ‡็จ‹ๅฏ่ง†ๅŒ–
```
### Phase 6: ็”Ÿไบงๅฐฑ็ปช (2-3ๅคฉ)
```
โœ“ 1. ๆ€ง่ƒฝไผ˜ๅŒ–
โœ“ 2. ้”™่ฏฏๅค„็†ๅ’Œๆขๅค
โœ“ 3. ๆ—ฅๅฟ—ๅ’Œ็›‘ๆŽง
โœ“ 4. ๆ–‡ๆกฃๅฎŒๅ–„
โœ“ 5. ้›†ๆˆๅˆฐEvolutionRunner
```
**ๆ€ป่ฎก: 11-17ๅคฉๅผ€ๅ‘ๆ—ถ้—ด**
---
## ๐Ÿ“ ไฝฟ็”จ็คบไพ‹
### ็คบไพ‹1: ้™ๆ€ๆจกๅผ (ๅฎŒๅ…จๅ…ผๅฎน)
```python
from shinka.evaluation import EvaluationAgent, AgentConfig
config = AgentConfig(mode="static")
agent = EvaluationAgent(config)
metrics, correct, error = agent.evaluate(
program_path="gen_42/main.py",
results_dir="gen_42/results"
)
# ่พ“ๅ‡บไธŽ็Žฐๆœ‰evaluate_with_auxiliary.pyๅฎŒๅ…จ็›ธๅŒ
```
### ็คบไพ‹2: ่‡ช้€‚ๅบ”ๆจกๅผ (ๆ™บ่ƒฝ่ฏ„ไผฐ)
```python
config = AgentConfig(
mode="adaptive",
enable_database_read=True,
llm_model="native-gemini-2.5-pro"
)
agent = EvaluationAgent(
config=config,
db_path="evolution_db.sqlite"
)
metrics, correct, error = agent.evaluate(
program_path="gen_100/main.py",
results_dir="gen_100/results"
)
# Agentไผš:
# 1. ๆŸฅ่ฏขๅ‰99ไปฃ็š„ๆœ€ไฝณ็จ‹ๅบ
# 2. ๅˆ†ๆžๅฝ“ๅ‰็จ‹ๅบ็›ธๅฏนๅކๅฒ็š„ๆ”น่ฟ›
# 3. ๆ™บ่ƒฝ้€‰ๆ‹ฉๆœ€็›ธๅ…ณ็š„auxiliary metrics
# 4. ็”Ÿๆˆไธชๆ€งๅŒ–็š„ๅ้ฆˆ
```
### ็คบไพ‹3: ๆŽข็ดขๆจกๅผ (่‡ชๅŠจๅ‘็Žฐๆ–ฐๆŒ‡ๆ ‡)
```python
config = AgentConfig(
mode="exploratory",
enable_dynamic_metrics=True,
enable_database_read=True
)
agent = EvaluationAgent(config, db_path="evolution_db.sqlite")
metrics, correct, error = agent.evaluate(
program_path="gen_150/main.py",
results_dir="gen_150/results"
)
# Agentๅฏ่ƒฝไผš:
# 1. ๅ‘็Žฐ็Žฐๆœ‰metrics้ƒฝๅœจplateau
# 2. ็”Ÿๆˆๆ–ฐ็š„metricๆฅๆฃ€ๆต‹"corner circle size pattern"
# 3. ้ชŒ่ฏๆ–ฐmetricไธŽไธปๅˆ†ๆ•ฐ็š„็›ธๅ…ณๆ€ง
# 4. ๅฆ‚ๆžœๆœ‰ๆ•ˆ๏ผŒๆณจๅ†Œๅˆฐๅ…จๅฑ€registryไพ›ๅŽ็ปญไฝฟ็”จ
```
---
## ๐Ÿ’ก ไผ˜ๅŠฟๅ’Œๅฝฑๅ“
### ๅฏน่ฟ›ๅŒ–็ณป็ปŸ็š„ๆ”น่ฟ›
1. **ๆ›ดๆ™บ่ƒฝ็š„่ฏ„ไผฐ**: Agentๅฏไปฅๆ นๆฎ่ฟ›ๅŒ–้˜ถๆฎต่ฐƒๆ•ด่ฏ„ไผฐ็ญ–็•ฅ
2. **่‡ช้€‚ๅบ”ๅ้ฆˆ**: ้’ˆๅฏนๅฝ“ๅ‰ไปฃ็š„ๅ…ทไฝ“้—ฎ้ข˜ๆไพ›้’ˆๅฏนๆ€งๅปบ่ฎฎ
3. **่‡ชๅŠจๅ‘็Žฐ**: ๆŽข็ดขๆ–ฐ็š„่ฏ„ไผฐ็ปดๅบฆ๏ผŒ็ช็ ดไบบๅทฅ่ฎพ่ฎก็š„ๅฑ€้™
4. **ๅฏ่งฃ้‡Šๆ€ง**: Agent็š„ๆŽจ็†่ฟ‡็จ‹ๅฏ่ฟฝๆบฏ๏ผŒๆ–นไพฟ่ฐƒ่ฏ•
### ไฟๆŒๅ…ผๅฎนๆ€ง
1. **ๆŽฅๅฃๅ…ผๅฎน**: ๅฎŒๅ…จ้ตๅฎˆ็Žฐๆœ‰็š„่พ“ๅ…ฅ่พ“ๅ‡บๅฅ‘็บฆ
2. **ๆธ่ฟ›ๅผ้‡‡็”จ**: ๅฏไปฅไปŽstaticๆจกๅผๅผ€ๅง‹๏ผŒ้€ๆญฅๅฏ็”จ้ซ˜็บงๅŠŸ่ƒฝ
3. **ๆ€ง่ƒฝๅฏๆŽง**: ๅฏไปฅ้…็ฝฎAgent็š„่ฎก็ฎ—้ข„็ฎ—
4. **ๆ— ็ ดๅๆ€ง**: ไธๅฝฑๅ“็Žฐๆœ‰ๅฎž้ชŒ็š„ๅฏๅค็Žฐๆ€ง
---
## ๐ŸŽ“ ๆ€ป็ป“
### Agent็š„ๆ ธๅฟƒๅฏนๅค–ๆŽฅๅฃ
```
่พ“ๅ…ฅๆŽฅๅฃ:
โ”œโ”€ ๅฟ…้œ€: program_path, results_dir
โ””โ”€ ๅฏ้€‰: db_path, agent_config
่พ“ๅ‡บๆŽฅๅฃ:
โ”œโ”€ ๅฟ…้œ€: metrics.json, correct.json
โ””โ”€ ๅฏ้€‰: agent_reasoning.json, visualizations/
ๆ•ฐๆฎๅบ“ๆŽฅๅฃ:
โ”œโ”€ READ: ๅฏ่ฏปๅ–ๆ‰€ๆœ‰ๅކๅฒ็จ‹ๅบๆ•ฐๆฎ
โ””โ”€ WRITE: ไป…ๅฏๅ†™ๅ…ฅprogram.metadataๅญ—ๆฎต
ๅทฅๅ…ทๆŽฅๅฃ:
โ”œโ”€ Ground Truth: ่ฟ่กŒๅ’Œ้ชŒ่ฏ็จ‹ๅบ
โ”œโ”€ Auxiliary Metrics: ้ข„ๅฎšไน‰ๅˆ†ๆžๆŒ‡ๆ ‡
โ”œโ”€ Database: ๆŸฅ่ฏขๅކๅฒๆ•ฐๆฎ
โ”œโ”€ Dynamic: ็”Ÿๆˆๆ–ฐๆŒ‡ๆ ‡
โ””โ”€ Visualization: ๅˆ†ๆžๅ’Œๅฏ่ง†ๅŒ–
```
### ๅ…ณ้”ฎ่ฎพ่ฎกๅŽŸๅˆ™
1. **ๆŽฅๅฃๅ…ผๅฎนไผ˜ๅ…ˆ**: Agentๅฟ…้กป่ƒฝๅฎŒๅ…จๆ›ฟไปฃ็Žฐๆœ‰evaluation่„šๆœฌ
2. **ๅฎ‰ๅ…จๆ€ง**: ไปฃ็ ๆ‰ง่กŒๆฒ™็ฎฑใ€ๆ•ฐๆฎๅบ“ๆƒ้™ๆŽงๅˆถ
3. **ๅฏๆ‰ฉๅฑ•ๆ€ง**: ๅทฅๅ…ท็ณป็ปŸๆ”ฏๆŒๆŒ็ปญๆทปๅŠ ๆ–ฐ่ƒฝๅŠ›
4. **ๅฏ่ง‚ๆต‹ๆ€ง**: Agent็š„ๅ†ณ็ญ–่ฟ‡็จ‹ๅฏ่ฟฝๆบฏๅ’Œ่ฐƒ่ฏ•
5. **ๆ€ง่ƒฝๅฏๆŽง**: ้€š่ฟ‡้…็ฝฎๅนณ่กกๆ™บ่ƒฝ็จ‹ๅบฆๅ’Œ่ฎก็ฎ—ๆˆๆœฌ
### ๅฎž็Žฐๅฏ่กŒๆ€ง
โœ… **ๆŠ€ๆœฏๅฏ่กŒ**: ๆ‰€ๆœ‰็ป„ไปถ้ƒฝๆœ‰ๆˆ็†Ÿ็š„ๅฎž็Žฐๆ–นๆกˆ
โœ… **ๆžถๆž„ๅ‹ๅฅฝ**: ไธŽ็Žฐๆœ‰็ณป็ปŸๆ— ็ผ้›†ๆˆ
โœ… **ๆธ่ฟ›ๅผ**: ๅฏไปฅๅˆ†้˜ถๆฎตๅฎž็Žฐๅ’Œ้ƒจ็ฝฒ
โœ… **ๅ‘ๅŽๅ…ผๅฎน**: ไธ็ ดๅ็Žฐๆœ‰ๅฎž้ชŒ
---
**่ฟ™ไธชAgentๅฐ†evaluationไปŽๅ›บๅฎšๆต็จ‹ๆๅ‡ไธบๆ™บ่ƒฝๅ†ณ็ญ–่ฟ‡็จ‹๏ผŒๅŒๆ—ถไฟๆŒไธŽ็Žฐๆœ‰็ณป็ปŸ็š„ๅฎŒ็พŽๅ…ผๅฎน๏ผ** ๐Ÿš€