| # Evaluation Agent ่ฎพ่ฎกๆนๆก |
|
|
| ## ๐ ๅฏ่กๆงๅๆ |
|
|
| **็ป่ฎบ๏ผๅฎๅ
จๅฏ่ก๏ผ** ๅฐevaluation่ๆฌๆน้ ๆagentไธไป
ๆๆฏไธๅฏ่ก๏ผ่ไธๅฏไปฅๆพ่ๅขๅผบ็ณป็ป็้ๅบๆงๅๆบ่ฝๅ็จๅบฆใ |
|
|
| --- |
|
|
| ## ๐๏ธ ๅฝๅๆถๆๅๆ |
|
|
| ### ๅฝๅEvaluationๅทฅไฝๆต |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ EvolutionRunner (ๆงๅถๅจ) โ |
| โ โโ ็ๆๆฐไปฃ็ (gen_N/main.py) โ |
| โ โโ ๆไบคjobๅฐJobScheduler โ |
| โ โโ ็ญๅพ
็ปๆ โ |
| โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ |
| โผ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ JobScheduler ๆง่กๅฝไปค: โ |
| โ python evaluate_with_auxiliary.py \ โ |
| โ --program_path gen_N/main.py \ โ |
| โ --results_dir gen_N/results โ |
| โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ |
| โผ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ Evaluation Script (็ฌ็ซ่ฟ็จ) โ |
| โ โโ ๅ ่ฝฝ็จๅบ โ |
| โ โโ ่ฟ่กๅฎ้ช (run_packing) โ |
| โ โโ ้ช่ฏ็ปๆ (validate_packing) โ |
| โ โโ ่ฎก็ฎmetrics (ๅบๅฎ็7ไธชauxiliary metrics) โ |
| โ โโ ็ๆๆๆฌๅ้ฆ โ |
| โ โโ ไฟๅญ็ปๆๅฐๆไปถ: โ |
| โ โข metrics.json โ |
| โ โข correct.json โ |
| โ โข extra.npz โ |
| โ โข auxiliary_analysis.json โ |
| โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ |
| โผ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ EvolutionRunner ่ฏปๅ็ปๆ โ |
| โ โโ ่งฃๆ metrics.json โ |
| โ โโ ๆๅ combined_score, public_metrics โ |
| โ โโ ๅๅ
ฅๆฐๆฎๅบ (ProgramDatabase) โ |
| โ โโ ็จไบ้ๆฉไธไธไปฃ็ถ็จๅบ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| ### ๅ
ณ้ฎๆฅๅฃๅฅ็บฆ |
|
|
| **่พๅ
ฅๆฅๅฃ** (ๅฝไปค่กๅๆฐ): |
| ```python |
| --program_path: str # ่ฆ่ฏไผฐ็็จๅบ่ทฏๅพ |
| --results_dir: str # ็ปๆไฟๅญ็ฎๅฝ |
| --aux_config: str # (ๅฏ้) ่พ
ๅฉ่ฏไผฐ้
็ฝฎ |
| ``` |
|
|
| **่พๅบๆฅๅฃ** (ๆไปถ็ณป็ป): |
| ```json |
| # metrics.json |
| { |
| "combined_score": 2.635, # ไธป่ฏๅ (ๅฟ
้กป) |
| "public": { # ๅ
ฌๅผๆๆ (LLMๅฏ่ง) |
| "centers_str": "...", |
| "num_circles": 26, |
| "aux_packing_efficiency": 0.842, |
| "aux_gap_analysis": 0.756, |
| ... |
| }, |
| "private": { # ็งๆๆๆ (ไป
่ฎฐๅฝ) |
| "reported_sum_of_radii": 2.635 |
| }, |
| "text_feedback": "..." # (ๅฏ้) ๆๆฌๅ้ฆ |
| } |
| |
| # correct.json |
| { |
| "correct": true, |
| "error": null |
| } |
| ``` |
|
|
| **ๆฐๆฎๅบSchema** (Program่กจ): |
| ```python |
| @dataclass |
| class Program: |
| # ่บซไปฝๆ ่ฏ |
| id: str |
| code: str |
| generation: int |
| parent_id: Optional[str] |
| |
| # ่ฏไผฐ็ปๆ (็ฑevaluationๅๅ
ฅ) |
| combined_score: float |
| public_metrics: Dict[str, Any] |
| private_metrics: Dict[str, Any] |
| text_feedback: str |
| correct: bool |
| |
| # ่พ
ๅฉๆฐๆฎ |
| embedding: List[float] |
| metadata: Dict[str, Any] |
| |
| # ่ฟๅๅ
ณ็ณป |
| archive_inspiration_ids: List[str] |
| top_k_inspiration_ids: List[str] |
| children_count: int |
| ``` |
|
|
| --- |
|
|
| ## ๐ค Agentๅๆน้ ๆนๆก |
|
|
| ### ๆ ธๅฟ่ฎพ่ฎก็ๅฟต |
|
|
| **Agent โ ่ๆฌ็ๅบๅซ:** |
| 1. **่ชไธปๅณ็ญ**: Agent่ฝๆ นๆฎcontextๅณๅฎๅๆ็ญ็ฅ |
| 2. **ๅจๆๅทฅๅ
ทไฝฟ็จ**: Agent่ฝ่ฐ็จไธๅๅทฅๅ
ทใ็ๆๆฐไปฃ็ |
| 3. **ๅๅฒๆ็ฅ**: Agent่ฝ่ฎฟ้ฎๆฐๆฎๅบไบ่งฃ่ฟๅๅๅฒ |
| 4. **ๅ
ๅญฆไน **: Agent่ฝๆน่ฟ่ชๅทฑ็่ฏไผฐ็ญ็ฅ |
|
|
| ### Agentๆถๆ่ฎพ่ฎก |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ EvaluationAgent (ไธปๆงๅถๅจ) โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ Core Components: โ โ |
| โ โ โข LLM (decision maker) โ โ |
| โ โ โข Tool Registry (ๅฏ่ฐ็จ็ๅทฅๅ
ท้) โ โ |
| โ โ โข Database Access (่ฏปๅๅๅฒๆฐๆฎ) โ โ |
| โ โ โข Code Executor (ๅฎๅ
จๆง่ก็ๆ็ไปฃ็ ) โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ Workflow: โ โ |
| โ โ 1. ๆฅๆถ่ฏไผฐ่ฏทๆฑ (program_path, results_dir) โ โ |
| โ โ 2. ๆฅ่ฏขๆฐๆฎๅบ่ทๅcontext โ โ |
| โ โ 3. LLM่งๅ่ฏไผฐ็ญ็ฅ โ โ |
| โ โ 4. ๆง่ก่ฏไผฐๆญฅ้ชค (่ฐ็จๅทฅๅ
ท/็ๆไปฃ็ ) โ โ |
| โ โ 5. ่ๅ็ปๆๅนถ็ๆๅ้ฆ โ โ |
| โ โ 6. (ๅฏ้) ๆดๆฐๆฐๆฎๅบๅ
ไฟกๆฏ โ โ |
| โ โ 7. ไฟๅญๆ ๅ่พๅบๆไปถ โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| |
| Agentๅฏ็จ็ๅทฅๅ
ท (Tools): |
| |
| โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ |
| โ Ground Truth โ โ Auxiliary โ โ Dynamic Metric โ |
| โ Evaluation โ โ Metrics โ โ Generator โ |
| โ โข ่ฟ่ก็จๅบ โ โ โข ้ขๅฎไนๆๆ โ โ โข LLM็ๆไปฃ็ โ |
| โ โข ้ช่ฏ็บฆๆ โ โ โข ๆณจๅ็ณป็ป โ โ โข ็ผ่ฏๅนถๆง่ก โ |
| โ โข ่ฎก็ฎไธปๅๆฐ โ โ โ โ โข ๅฎๅ
จๆฒ็ฎฑ โ |
| โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ |
| |
| โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ |
| โ Database Query โ โ Visualization โ โ Meta Analysis โ |
| โ โข ๆฅ่ฏขๅๅฒ โ โ โข ็ๆๅพ่กจ โ โ โข ่ถๅฟๅๆ โ |
| โ โข ็ป่ฎกๅๆ โ โ โข ไฟๅญๅฏ่งๅ โ โ โข ็ญ็ฅๆจ่ โ |
| โ โข ๅฏนๆฏ็จๅบ โ โ โ โ โ |
| โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ |
| |
| ๆฐๆฎๅบ่ฎฟ้ฎๆ้: |
| |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ Database Interface (ProgramDatabase) โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ READ Operations (Agentๅฏ็จ): โ โ |
| โ โ โข get_all_programs() โ โ |
| โ โ โข get_programs_by_generation(gen) โ โ |
| โ โ โข get_top_programs(n, metric) โ โ |
| โ โ โข get_best_program(metric) โ โ |
| โ โ โข get_program(id) โ โ |
| โ โ โข ่ชๅฎไนSQLๆฅ่ฏข (ๅ้) โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ WRITE Operations (่ฐจๆ
ไฝฟ็จ): โ โ |
| โ โ โข ๅช่ฝๅๅ
ฅmetadataๅญๆฎต โ โ |
| โ โ โข ไธ่ฝไฟฎๆน combined_score, correct ็ญๆ ธๅฟๅญๆฎต โ โ |
| โ โ โข ๅฏไปฅๆทปๅ ้ขๅค็ๅๆ็ปๆๅฐmetadata โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| --- |
|
|
| ## ๐ Agentๅฏนๅคๆฅๅฃ่ฎพ่ฎก |
|
|
| ### 1. ๅฝไปค่กๆฅๅฃ (ไฟๆๅ
ผๅฎน) |
|
|
| ```bash |
| # ๅบๆฌๆฅๅฃ (ไธๅฝๅๅฎๅ
จๅ
ผๅฎน) |
| python evaluation_agent.py \ |
| --program_path gen_42/main.py \ |
| --results_dir gen_42/results |
| |
| # ๆฉๅฑๆฅๅฃ (ๆฐๅขๅ่ฝ) |
| python evaluation_agent.py \ |
| --program_path gen_42/main.py \ |
| --results_dir gen_42/results \ |
| --db_path path/to/evolution.sqlite \ # Agentๅฏ่ฎฟ้ฎๆฐๆฎๅบ |
| --agent_mode adaptive \ # ่ฏไผฐๆจกๅผ: static|adaptive|exploratory |
| --enable_dynamic_metrics \ # ๅ
่ฎธ็ๆๆฐmetrics |
| --feedback_style detailed # ๅ้ฆ้ฃๆ ผ: minimal|normal|detailed |
| ``` |
|
|
| ### 2. Python APIๆฅๅฃ |
|
|
| ```python |
| from shinka.evaluation import EvaluationAgent, AgentConfig |
| |
| # ้
็ฝฎAgent |
| agent_config = AgentConfig( |
| # LLM้
็ฝฎ |
| llm_model="native-gemini-2.5-pro", |
| llm_temperature=0.7, |
| |
| # ่ฏไผฐๆจกๅผ |
| mode="adaptive", # static | adaptive | exploratory |
| |
| # ๅทฅๅ
ท่ฎฟ้ฎๆ้ |
| enable_ground_truth=True, # ๅฟ
้กป |
| enable_auxiliary_metrics=True, # ้ขๅฎไน่พ
ๅฉๆๆ |
| enable_dynamic_metrics=True, # LLM็ๆๆฐๆๆ |
| enable_database_read=True, # ่ฏปๅๅๅฒๆฐๆฎ |
| enable_database_write_metadata=False, # ๅๅ
ฅๅ
ๆฐๆฎ |
| |
| # ๅฎๅ
จ้
็ฝฎ |
| code_execution_timeout=30, # ็ๆไปฃ็ ๆง่ก่ถ
ๆถ |
| max_tool_calls=20, # ๆๅคงๅทฅๅ
ท่ฐ็จๆฌกๆฐ |
| sandboxed_execution=True, # ๆฒ็ฎฑๆง่ก |
| |
| # ่พๅบ้
็ฝฎ |
| generate_text_feedback=True, |
| save_detailed_analysis=True, |
| visualization=True, |
| ) |
| |
| # ๅๅปบAgent |
| agent = EvaluationAgent( |
| config=agent_config, |
| db_path="path/to/evolution.sqlite" # ๅฏ้ |
| ) |
| |
| # ๆง่ก่ฏไผฐ |
| metrics, correct, error = agent.evaluate( |
| program_path="gen_42/main.py", |
| results_dir="gen_42/results" |
| ) |
| |
| # Agentไผ่ชๅจไฟๅญๆ ๅ่พๅบๆไปถ |
| # - metrics.json |
| # - correct.json |
| # - auxiliary_analysis.json |
| # - (ๅฏ้) agent_reasoning.json # Agent็ๅณ็ญ่ฟ็จ |
| ``` |
|
|
| ### 3. EvolutionRunner้ๆๆฅๅฃ |
|
|
| ```python |
| from shinka.core import EvolutionRunner, EvolutionConfig |
| from shinka.database import DatabaseConfig |
| from shinka.launch import LocalJobConfig |
| from shinka.evaluation import EvaluationAgentConfig # ๆฐๅข |
| |
| # ้
็ฝฎjobไฝฟ็จAgent่ฏไผฐๅจ |
| job_config = LocalJobConfig( |
| eval_program_path="shinka/evaluation/agent_main.py", # Agentๅ
ฅๅฃ |
| extra_cmd_args={ |
| "agent_mode": "adaptive", |
| "enable_dynamic_metrics": True, |
| "db_path": "auto", # ่ชๅจไผ ้ๆฐๆฎๅบ่ทฏๅพ |
| } |
| ) |
| |
| # ๆฐๆฎๅบ้
็ฝฎ |
| db_config = DatabaseConfig( |
| db_path="evolution_db.sqlite", |
| # ... ๅ
ถไป้
็ฝฎ |
| ) |
| |
| # ่ฟๅ้
็ฝฎ |
| evo_config = EvolutionConfig( |
| # ... ๅ
ถไป้
็ฝฎ |
| use_text_feedback=True, # ๆฅๆถAgent็ๆ็ๅ้ฆ |
| ) |
| |
| # ่ฟ่กๆถ๏ผAgentไผ่ชๅจ่ทๅพ: |
| # 1. ๅฝๅgeneration็็จๅบ่ทฏๅพ |
| # 2. ๆฐๆฎๅบ่ฎฟ้ฎๆ้ (้่ฟ--db_pathๅๆฐ) |
| # 3. ๅๅฒ็จๅบไฟกๆฏ |
| runner = EvolutionRunner( |
| job_config=job_config, |
| db_config=db_config, |
| evo_config=evo_config, |
| ) |
| runner.run() |
| ``` |
|
|
| --- |
|
|
| ## ๐ ๏ธ Agentๅทฅๅ
ท็ณป็ป่ฎพ่ฎก |
|
|
| ### ๅทฅๅ
ทๆฅๅฃ่ง่ |
|
|
| ```python |
| from typing import Any, Dict, Optional |
| from dataclasses import dataclass |
| |
| @dataclass |
| class ToolResult: |
| """ๅทฅๅ
ทๆง่ก็ปๆ""" |
| success: bool |
| data: Any |
| error: Optional[str] = None |
| cost: float = 0.0 # APIๆๆฌ |
| |
| class Tool: |
| """ๅทฅๅ
ทๅบ็ฑป""" |
| name: str |
| description: str |
| parameters: Dict[str, Any] # JSON Schema |
| |
| def execute(self, **kwargs) -> ToolResult: |
| """ๆง่กๅทฅๅ
ท้ป่พ""" |
| raise NotImplementedError |
| ``` |
|
|
| ### ๆ ธๅฟๅทฅๅ
ทๆธ
ๅ |
|
|
| ```python |
| # ============================================================================ |
| # 1. GROUND TRUTH EVALUATION (ๅฟ
้ๅทฅๅ
ท) |
| # ============================================================================ |
| |
| class RunProgramTool(Tool): |
| """่ฟ่ก่ขซ่ฏไผฐ็จๅบๅนถ่ทๅๅๅง็ปๆ""" |
| name = "run_program" |
| description = "Execute the program and get raw results (centers, radii, score)" |
| |
| def execute(self, program_path: str, num_runs: int = 1) -> ToolResult: |
| # ่ฐ็จ run_shinka_eval ็ๅบๅฑ้ป่พ |
| # ่ฟๅ: centers, radii, reported_score |
| pass |
| |
| class ValidateResultsTool(Tool): |
| """้ช่ฏ็จๅบ่พๅบๆฏๅฆๆปก่ถณ็บฆๆ""" |
| name = "validate_results" |
| description = "Validate if results satisfy all constraints" |
| |
| def execute(self, centers, radii) -> ToolResult: |
| # ่ฐ็จ adapted_validate_packing |
| # ่ฟๅ: is_valid, error_message |
| pass |
| |
| # ============================================================================ |
| # 2. AUXILIARY METRICS (้ขๅฎไนๅๆๅทฅๅ
ท) |
| # ============================================================================ |
| |
| class ComputeMetricTool(Tool): |
| """่ฎก็ฎ้ขๅฎไน็่พ
ๅฉๆๆ """ |
| name = "compute_metric" |
| description = "Compute a predefined auxiliary metric" |
| parameters = { |
| "metric_name": { |
| "type": "string", |
| "enum": ["packing_efficiency", "gap_analysis", "edge_utilization", ...] |
| } |
| } |
| |
| def execute(self, metric_name: str, centers, radii) -> ToolResult: |
| # ่ฐ็จ METRIC_REGISTRY.get(metric_name) |
| pass |
| |
| class ListMetricsTool(Tool): |
| """ๅๅบๆๆๅฏ็จ็้ขๅฎไนๆๆ """ |
| name = "list_metrics" |
| |
| def execute(self) -> ToolResult: |
| return ToolResult( |
| success=True, |
| data=METRIC_REGISTRY.list_metrics() |
| ) |
| |
| # ============================================================================ |
| # 3. DATABASE ACCESS (ๅๅฒๆฐๆฎๅทฅๅ
ท) |
| # ============================================================================ |
| |
| class QueryDatabaseTool(Tool): |
| """ๆฅ่ฏขๆฐๆฎๅบ่ทๅๅๅฒ็จๅบไฟกๆฏ""" |
| name = "query_database" |
| description = "Query historical programs from database" |
| parameters = { |
| "query_type": { |
| "type": "string", |
| "enum": ["top_programs", "by_generation", "best_program", "all"] |
| }, |
| "filters": { |
| "type": "object", |
| "properties": { |
| "metric": {"type": "string"}, |
| "n": {"type": "integer"}, |
| "generation": {"type": "integer"} |
| } |
| } |
| } |
| |
| def execute(self, query_type: str, filters: Dict) -> ToolResult: |
| if query_type == "top_programs": |
| programs = self.db.get_top_programs( |
| n=filters.get("n", 10), |
| metric=filters.get("metric", "combined_score") |
| ) |
| elif query_type == "by_generation": |
| programs = self.db.get_programs_by_generation(filters["generation"]) |
| # ... |
| |
| return ToolResult( |
| success=True, |
| data=[p.to_dict() for p in programs] |
| ) |
| |
| class CompareWithHistoryTool(Tool): |
| """ๅฏนๆฏๅฝๅ็จๅบไธๅๅฒ็จๅบ""" |
| name = "compare_with_history" |
| |
| def execute(self, current_metrics: Dict, comparison_type: str) -> ToolResult: |
| # comparison_type: "best" | "parent" | "generation_average" |
| # ่ฟๅๅฏนๆฏๅๆ็ปๆ |
| pass |
| |
| # ============================================================================ |
| # 4. DYNAMIC METRIC GENERATION (LLM็ๆๆฐๆๆ ) |
| # ============================================================================ |
| |
| class GenerateMetricCodeTool(Tool): |
| """่ฎฉLLM็ๆๆฐ็่ฏไผฐๆๆ ไปฃ็ """ |
| name = "generate_metric_code" |
| description = "Generate Python code for a new evaluation metric" |
| parameters = { |
| "metric_purpose": {"type": "string"}, |
| "inspiration_from": {"type": "string"} # ๅ่ๅทฒๆๆๆ |
| } |
| |
| def execute(self, metric_purpose: str, inspiration_from: str = None) -> ToolResult: |
| # ่ฐ็จLLM็ๆๆฐmetricไปฃ็ |
| # ไฝฟ็จ LLMGeneratedMetric ๆกๆถ |
| prompt = f""" |
| Generate a Python function to compute a new auxiliary metric for circle packing. |
| |
| Purpose: {metric_purpose} |
| |
| Requirements: |
| 1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult |
| 2. Return MetricResult with name, value, interpretation, description, details |
| 3. Use numpy for computations |
| 4. Handle edge cases gracefully |
| |
| Example structure: |
| ```python |
| def my_metric(centers, radii): |
| # Your analysis logic here |
| score = ... |
| |
| return MetricResult( |
| name="my_metric", |
| value=float(score), |
| interpretation="higher_better", |
| description="What this metric measures", |
| details={{"key": "value"}} |
| ) |
| ``` |
| """ |
| |
| llm_response = self.llm.query(prompt) |
| code = extract_code_from_response(llm_response) |
| |
| return ToolResult( |
| success=True, |
| data={"code": code, "cost": llm_response.cost} |
| ) |
| |
| class CompileAndTestMetricTool(Tool): |
| """็ผ่ฏๅนถๆต่ฏLLM็ๆ็ๆๆ ไปฃ็ """ |
| name = "compile_and_test_metric" |
| |
| def execute(self, code: str, test_data: Dict) -> ToolResult: |
| metric = LLMGeneratedMetric( |
| name="llm_metric", |
| code=code, |
| description="LLM generated metric", |
| interpretation="higher_better" |
| ) |
| |
| if not metric.compile(): |
| return ToolResult(success=False, error="Compilation failed") |
| |
| # ๆต่ฏๆง่ก |
| try: |
| result = metric.evaluate( |
| centers=test_data["centers"], |
| radii=test_data["radii"] |
| ) |
| return ToolResult(success=True, data=result) |
| except Exception as e: |
| return ToolResult(success=False, error=str(e)) |
| |
| # ============================================================================ |
| # 5. VISUALIZATION & ANALYSIS (ๅๆๅทฅๅ
ท) |
| # ============================================================================ |
|
|
| class VisualizeTool(Tool): |
| """็ๆๅฏ่งๅ""" |
| name = "visualize" |
| |
| def execute(self, vis_type: str, data: Dict, output_path: str) -> ToolResult: |
| # vis_type: "packing" | "metrics_trend" | "comparison" |
| pass |
| |
| class StatisticalAnalysisTool(Tool): |
| """็ป่ฎกๅๆๅทฅๅ
ท""" |
| name = "statistical_analysis" |
| |
| def execute(self, data: List[float], analysis_type: str) -> ToolResult: |
| # analysis_type: "trend" | "distribution" | "correlation" |
| pass |
| |
| # ============================================================================ |
| # 6. META OPERATIONS (ๅ
ๆไฝ) |
| # ============================================================================ |
|
|
| class UpdateMetadataTool(Tool): |
| """ๆดๆฐ็จๅบ็metadataๅญๆฎต""" |
| name = "update_metadata" |
| description = "Add analysis results to program metadata (write to DB)" |
| |
| def execute(self, program_id: str, metadata: Dict) -> ToolResult: |
| # ไป
ๅ
่ฎธๅๅ
ฅmetadataๅญๆฎต๏ผไธ่ฝไฟฎๆนๆ ธๅฟ่ฏไผฐๅญๆฎต |
| program = self.db.get_program(program_id) |
| if program: |
| program.metadata.update(metadata) |
| # ๅๅๆฐๆฎๅบ |
| # ๆณจๆ๏ผ้่ฆๆฉๅฑProgramDatabaseๆทปๅ update_metadataๆนๆณ |
| pass |
| ``` |
| |
| --- |
|
|
| ## ๐ง Agentๅณ็ญๆต็จ |
|
|
| ### Mode 1: Static Mode (ๅ
ผๅฎนๆจกๅผ) |
|
|
| ```python |
| def static_evaluation(agent, program_path, results_dir): |
| """ |
| ๅฎๅ
จๅ
ผๅฎน็ฐๆevaluation่ๆฌ็่กไธบ |
| """ |
| # 1. ่ฟ่ก็จๅบ |
| result = agent.tools["run_program"].execute(program_path) |
| centers, radii, score = result.data |
| |
| # 2. ้ช่ฏ็ปๆ |
| validation = agent.tools["validate_results"].execute(centers, radii) |
| correct = validation.data["is_valid"] |
| |
| # 3. ่ฎก็ฎ้ขๅฎไนauxiliary metrics |
| auxiliary_results = {} |
| for metric_name in agent.config.enabled_metrics: |
| metric_result = agent.tools["compute_metric"].execute( |
| metric_name, centers, radii |
| ) |
| auxiliary_results[metric_name] = metric_result.data.value |
| |
| # 4. ็ๆๆ ๅๅ้ฆ |
| feedback = generate_standard_feedback(auxiliary_results, score) |
| |
| # 5. ไฟๅญ็ปๆ |
| metrics = { |
| "combined_score": score, |
| "public": { |
| "centers_str": format_centers_string(centers), |
| "num_circles": len(centers), |
| **{f"aux_{k}": v for k, v in auxiliary_results.items()} |
| }, |
| "private": {"reported_sum_of_radii": score}, |
| "text_feedback": feedback |
| } |
| |
| save_metrics(results_dir, metrics, correct) |
| return metrics, correct |
| ``` |
|
|
| ### Mode 2: Adaptive Mode (ๆบ่ฝๆจกๅผ) |
|
|
| ```python |
| def adaptive_evaluation(agent, program_path, results_dir, db_path): |
| """ |
| Agentๆ นๆฎcontextๆบ่ฝๅณ็ญ่ฏไผฐ็ญ็ฅ |
| """ |
| # 1. ่ทๅcontext |
| context = agent.gather_context(program_path, db_path) |
| |
| # 2. LLM่งๅ่ฏไผฐ็ญ็ฅ |
| plan = agent.llm.plan_evaluation(context) |
| |
| # ็คบไพplan: |
| # { |
| # "steps": [ |
| # {"action": "run_program", "params": {...}}, |
| # {"action": "query_database", "params": {"query_type": "best_program"}}, |
| # {"action": "compute_metric", "params": {"metric_name": "packing_efficiency"}}, |
| # {"action": "compare_with_history", "params": {"comparison_type": "best"}}, |
| # {"action": "generate_feedback", "params": {...}} |
| # ] |
| # } |
| |
| # 3. ๆง่กplan |
| execution_log = [] |
| for step in plan["steps"]: |
| tool = agent.tools[step["action"]] |
| result = tool.execute(**step["params"]) |
| execution_log.append(result) |
| |
| # ๅฆๆๆๆญฅๅคฑ่ดฅ๏ผLLMๅฏไปฅ่ฐๆด็ญ็ฅ |
| if not result.success: |
| plan = agent.llm.replan(plan, execution_log, result.error) |
| |
| # 4. LLM่ๅ็ปๆๅนถ็ๆๅ้ฆ |
| final_metrics, feedback = agent.llm.aggregate_results(execution_log, context) |
| |
| # 5. ไฟๅญ็ปๆ (ไฟ่ฏๆฅๅฃๅ
ผๅฎนๆง) |
| save_metrics(results_dir, final_metrics, correct) |
| |
| # 6. (ๅฏ้) ไฟๅญAgentๆจ็่ฟ็จ |
| save_agent_reasoning(results_dir, plan, execution_log) |
| |
| return final_metrics, correct |
| ``` |
|
|
| ### Mode 3: Exploratory Mode (ๆข็ดขๆจกๅผ) |
|
|
| ```python |
| def exploratory_evaluation(agent, program_path, results_dir, db_path): |
| """ |
| Agentไธปๅจๆข็ดขๆฐ็่ฏไผฐๆนๆณ |
| """ |
| # 1. ๆ ๅ่ฏไผฐ |
| base_metrics, correct = adaptive_evaluation(agent, program_path, results_dir, db_path) |
| |
| # 2. ๅๆๅๅฒ่ถๅฟ |
| trend_analysis = agent.tools["statistical_analysis"].execute( |
| data=get_historical_scores(agent.db), |
| analysis_type="trend" |
| ) |
| |
| # 3. ๅฆๆๅ็ฐ่ฏไผฐ็ฒ็น๏ผ็ๆๆฐmetric |
| if agent.detect_evaluation_gap(trend_analysis): |
| # LLM็ๆๆฐmetricไปฃ็ |
| new_metric_code = agent.tools["generate_metric_code"].execute( |
| metric_purpose="Identify patterns missed by existing metrics" |
| ) |
| |
| # ็ผ่ฏๅนถๆต่ฏ |
| test_result = agent.tools["compile_and_test_metric"].execute( |
| code=new_metric_code.data["code"], |
| test_data={"centers": centers, "radii": radii} |
| ) |
| |
| if test_result.success: |
| # ๆณจๅๆฐmetricๅฐๅ
จๅฑregistry |
| register_new_metric(new_metric_code.data["code"]) |
| |
| # ้ๆฐ่ฏไผฐๅ
ๅซๆฐmetric |
| extended_metrics = compute_with_new_metric(centers, radii) |
| base_metrics["public"].update(extended_metrics) |
| |
| # 4. ไฟๅญๆฉๅฑ็ปๆ |
| save_metrics(results_dir, base_metrics, correct) |
| |
| return base_metrics, correct |
| ``` |
|
|
| --- |
|
|
| ## ๐ ๅฎๅ
จๆง่ฎพ่ฎก |
|
|
| ### ไปฃ็ ๆง่กๆฒ็ฎฑ |
|
|
| ```python |
| class SafeCodeExecutor: |
| """ๅฎๅ
จ็ไปฃ็ ๆง่ก็ฏๅข""" |
| |
| def __init__(self, timeout=30): |
| self.timeout = timeout |
| self.allowed_imports = { |
| 'numpy', 'scipy', 'math', 'statistics' |
| } |
| self.forbidden_operations = { |
| '__import__', 'eval', 'exec', 'compile', |
| 'open', 'file', 'input', 'raw_input' |
| } |
| |
| def execute(self, code: str, inputs: Dict) -> Any: |
| """ๅจๅ้็ฏๅขไธญๆง่กไปฃ็ """ |
| # 1. ้ๆๅๆๆฃๆฅ |
| if self.has_forbidden_operations(code): |
| raise SecurityError("Forbidden operations detected") |
| |
| # 2. ๅๅปบๅ้namespace |
| namespace = { |
| 'np': numpy, |
| 'MetricResult': MetricResult, |
| # ... ๅชๆไพๅฟ
่ฆ็ๆจกๅ |
| } |
| namespace.update(inputs) |
| |
| # 3. ่ถ
ๆถๆง่ก |
| with timeout(self.timeout): |
| exec(code, namespace) |
| |
| return namespace |
| ``` |
|
|
| ### ๆฐๆฎๅบ่ฎฟ้ฎๆ้ๆงๅถ |
|
|
| ```python |
| class RestrictedDatabaseAccess: |
| """ๅ้็ๆฐๆฎๅบ่ฎฟ้ฎๆฅๅฃ""" |
| |
| def __init__(self, db: ProgramDatabase): |
| self.db = db |
| self.read_only_methods = [ |
| 'get_all_programs', 'get_programs_by_generation', |
| 'get_top_programs', 'get_best_program', 'get_program' |
| ] |
| self.write_allowed_fields = ['metadata'] # ๅช่ฝๅmetadata |
| |
| def __getattr__(self, name): |
| if name in self.read_only_methods: |
| return getattr(self.db, name) |
| else: |
| raise PermissionError(f"Method {name} not allowed for agent") |
| |
| def update_metadata(self, program_id: str, metadata: Dict): |
| """ๅฏไธๅ
่ฎธ็ๅๅ
ฅๆไฝ""" |
| program = self.db.get_program(program_id) |
| if program: |
| program.metadata.update(metadata) |
| # ้่ฆๅจProgramDatabaseไธญๆทปๅ ๆญคๆนๆณ |
| self.db.update_program_metadata(program_id, program.metadata) |
| ``` |
|
|
| --- |
|
|
| ## ๐ Agentไธๅค็็ๆฐๆฎๆต |
|
|
| ``` |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ EvolutionRunner (ไธป็ณป็ป) โ |
| โ โ |
| โ [ๆฏไธไปฃ่ฟๅ] โ |
| โ โโ ็ๆๆฐไปฃ็ : gen_N/main.py โ |
| โ โโ ่ฐ็จAgent่ฏไผฐ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ โผ โ |
| โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ โ EvaluationAgent โ โ |
| โ โ โ (็ฌ็ซ่ฟ็จ) โ โ |
| โ โ โ โ โ |
| โ โ โ ่พๅ
ฅ: โ โ |
| โ โ โ โข program_path โ โ |
| โ โ โ โข results_dir โ โ |
| โ โ โ โข db_path (ๅฏ้) โ โ |
| โ โ โ โ โ |
| โ โ โ Agentๅ
้จๆต็จ: โ โ |
| โ โ โ 1. ๅ ่ฝฝ็จๅบ โ โ |
| โ โ โ 2. ่ฟ่ก่ฏไผฐ โ โ |
| โ โ ่ฏปๅๆฐๆฎๅบ โโโโโโโโโโโโโผ 3. ๆฅ่ฏขDBๅๅฒ โโโ โ โ |
| โ โ โ 4. LLM่งๅ โ โ โ |
| โ โ โ 5. ๅทฅๅ
ท่ฐ็จ โ โ โ |
| โ โ โ 6. ่ๅ็ปๆ โ โ โ |
| โ โ (ๅฏ้)ๅmetadata โโโโโโโผ 7. ไฟๅญ่พๅบ โ โ โ |
| โ โ โ โ โ โ |
| โ โ โ ่พๅบๆไปถ: โ โ โ |
| โ โ โ โข metrics.json โ โ โ |
| โ โ โ โข correct.json โ โ โ |
| โ โ โ โข agent_log.jsonโ โ โ |
| โ โ โโโโโโโโโโโโโโโโโโโโผโโโโโโโโ โ |
| โ โ โ โ |
| โ โโ ่ฏปๅ่ฏไผฐ็ปๆ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ โข combined_score โ |
| โ โ โข public_metrics (ๅซaux metrics) โ |
| โ โ โข text_feedback โ |
| โ โ โ |
| โ โโ ๅๅ
ฅๆฐๆฎๅบ (ProgramDatabase) โ |
| โ โ โข ๅๅปบๆฐProgram่ฎฐๅฝ โ |
| โ โ โข ไฟๅญๆๆmetrics โ |
| โ โ โข ๆดๆฐarchive โ |
| โ โ โ |
| โ โโ ้ๆฉ็ถไปฃ โ ไธไธไปฃ โโโโโโโโโโโโโโโโโโโโโโโโโบ โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| |
| ๆฐๆฎๅบ Schema (ๅ
ฑไบซ็ถๆ): |
| |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| โ SQLite: evolution_db.sqlite โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ programs ่กจ โ โ |
| โ โ โโ id (gen_N) โ โ |
| โ โ โโ code โ โ |
| โ โ โโ generation (N) โ โ |
| โ โ โโ combined_score โโโ EvolutionRunnerๅๅ
ฅ โ โ |
| โ โ โโ public_metrics โโโ EvolutionRunnerๅๅ
ฅ โ โ |
| โ โ โโ text_feedback โโโ EvolutionRunnerๅๅ
ฅ โ โ |
| โ โ โโ correct โโโ EvolutionRunnerๅๅ
ฅ โ โ |
| โ โ โ โ โ |
| โ โ โโ metadata โโโ Agentๅฏๅๅ
ฅ (ๅฏ้) โ โ |
| โ โ { โ โ |
| โ โ "agent_analysis": {...}, โ โ |
| โ โ "custom_metrics": {...}, โ โ |
| โ โ "evaluation_reasoning": "..." โ โ |
| โ โ } โ โ |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ |
| โ โ |
| โ Agentๅฏ่ฏปๅๅ
จ้จๅๅฒๆฐๆฎ๏ผไฝๅช่ฝๅๅ
ฅmetadataๅญๆฎต โ |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
| ``` |
|
|
| --- |
|
|
| ## ๐ฏ Agentๅฏนๅคๆฅๅฃๆป็ป |
|
|
| ### ๅฟ
้ๆฅๅฃ (ไฟๆๅ
ผๅฎน) |
|
|
| ```python |
| # 1. ๅฝไปค่กๅๆฐๆฅๅฃ |
| --program_path: str # ๅฟ
้ |
| --results_dir: str # ๅฟ
้ |
| |
| # 2. ่พๅบๆไปถๆฅๅฃ (ๆ ๅๅฅ็บฆ) |
| metrics.json: { |
| "combined_score": float, # ๅฟ
้ |
| "public": dict, # ๅฟ
้ |
| "private": dict, # ๅฏ้ |
| "text_feedback": str # ๅฏ้ (use_text_feedback=Trueๆถ) |
| } |
| correct.json: { |
| "correct": bool, # ๅฟ
้ |
| "error": str | null # ๅฟ
้ |
| } |
| ``` |
|
|
| ### ๆฉๅฑๆฅๅฃ (Agent็นๆง) |
|
|
| ```python |
| # 1. ๆฐๆฎๅบ่ฎฟ้ฎๆฅๅฃ |
| --db_path: str # ๅฏ้๏ผๆไพๅAgentๅฏ่ฎฟ้ฎๅๅฒๆฐๆฎ |
| |
| # 2. Agentๆจกๅผ้
็ฝฎ |
| --agent_mode: str # static | adaptive | exploratory |
| --enable_dynamic_metrics: bool |
| --max_tool_calls: int |
| |
| # 3. ้ขๅค่พๅบๆไปถ |
| agent_reasoning.json: { # Agent็ๅณ็ญ่ฟ็จ (็จไบ่ฐ่ฏๅๅๆ) |
| "plan": [...], |
| "execution_log": [...], |
| "tool_costs": {...}, |
| "total_cost": float |
| } |
| |
| auxiliary_analysis.json # ่ฏฆ็ป็่พ
ๅฉๅๆ (ๅทฒๆ) |
| |
| visualizations/ # ๅฏ่งๅๆไปถ (ๅฏ้) |
| โโ packing_viz.png |
| โโ metrics_trend.png |
| โโ comparison.png |
| ``` |
|
|
| ### Python APIๆฅๅฃ |
|
|
| ```python |
| # 1. Agent็ฑปๆฅๅฃ |
| class EvaluationAgent: |
| def __init__( |
| self, |
| config: AgentConfig, |
| db_path: Optional[str] = None |
| ): |
| pass |
| |
| def evaluate( |
| self, |
| program_path: str, |
| results_dir: str |
| ) -> Tuple[Dict, bool, Optional[str]]: |
| """ |
| ่ฟๅ: (metrics, correct, error) |
| ไธ run_shinka_eval ๅฎๅ
จๅ
ผๅฎน |
| """ |
| pass |
| |
| # 2. ๅทฅๅ
ทๆฅๅฃ (ไพAgentๅ
้จไฝฟ็จ) |
| class Tool: |
| def execute(self, **kwargs) -> ToolResult: |
| pass |
| |
| # 3. ๆฐๆฎๅบๆฅๅฃๆฉๅฑ |
| class ProgramDatabase: |
| # ๆฐๅขๆนๆณไพAgentไฝฟ็จ |
| def update_program_metadata( |
| self, |
| program_id: str, |
| metadata: Dict |
| ) -> bool: |
| pass |
| ``` |
|
|
| --- |
|
|
| ## ๐ ๅฎ็ฐ่ทฏ็บฟๅพ |
|
|
| ### Phase 1: ๅบ็กAgentๆกๆถ (2-3ๅคฉ) |
|
|
| ``` |
| โ 1. ๅๅปบ EvaluationAgent ็ฑป้ชจๆถ |
| โ 2. ๅฎ็ฐ Tool ๅบ็ฑปๅๅทฅๅ
ทๆณจๅ็ณป็ป |
| โ 3. ้ๆ็ฐๆevaluationไปฃ็ ไธบๅทฅๅ
ท |
| - RunProgramTool |
| - ValidateResultsTool |
| - ComputeMetricTool |
| โ 4. ๅฎ็ฐ static_mode (ๅฎๅ
จๅ
ผๅฎน็ฐๆ่กไธบ) |
| โ 5. ๅๅ
ๆต่ฏ |
| ``` |
|
|
| ### Phase 2: ๆฐๆฎๅบ้ๆ (1-2ๅคฉ) |
|
|
| ``` |
| โ 1. ๅๅปบ RestrictedDatabaseAccess ๆฅๅฃ |
| โ 2. ๅฎ็ฐๆฐๆฎๅบๆฅ่ฏขๅทฅๅ
ท |
| - QueryDatabaseTool |
| - CompareWithHistoryTool |
| โ 3. ๆฉๅฑ ProgramDatabase.update_program_metadata() |
| โ 4. ้ๆๆต่ฏ |
| ``` |
|
|
| ### Phase 3: Adaptive Mode (3-4ๅคฉ) |
|
|
| ``` |
| โ 1. ๅฎ็ฐ LLM planning ้ป่พ |
| โ 2. Context gathering (ๅๅฒๆฐๆฎๅๆ) |
| โ 3. ๅจๆๅทฅๅ
ท่ฐ็จ |
| โ 4. ็ปๆ่ๅๅๅ้ฆ็ๆ |
| โ 5. ็ซฏๅฐ็ซฏๆต่ฏ |
| ``` |
|
|
| ### Phase 4: Dynamic Metrics (2-3ๅคฉ) |
|
|
| ``` |
| โ 1. ๅฎ็ฐ GenerateMetricCodeTool |
| โ 2. SafeCodeExecutor ๆฒ็ฎฑ |
| โ 3. ๅจๆmetricๆณจๅๅ้ช่ฏ |
| โ 4. Exploratory mode ๅฎ็ฐ |
| โ 5. ๅฎๅ
จๆงๆต่ฏ |
| ``` |
|
|
| ### Phase 5: ๅฏ่งๅๅๅๆ (1-2ๅคฉ) |
|
|
| ``` |
| โ 1. VisualizeTool |
| โ 2. StatisticalAnalysisTool |
| โ 3. Agentๆจ็่ฟ็จๅฏ่งๅ |
| ``` |
|
|
| ### Phase 6: ็ไบงๅฐฑ็ปช (2-3ๅคฉ) |
|
|
| ``` |
| โ 1. ๆง่ฝไผๅ |
| โ 2. ้่ฏฏๅค็ๅๆขๅค |
| โ 3. ๆฅๅฟๅ็ๆง |
| โ 4. ๆๆกฃๅฎๅ |
| โ 5. ้ๆๅฐEvolutionRunner |
| ``` |
|
|
| **ๆป่ฎก: 11-17ๅคฉๅผๅๆถ้ด** |
|
|
| --- |
|
|
| ## ๐ ไฝฟ็จ็คบไพ |
|
|
| ### ็คบไพ1: ้ๆๆจกๅผ (ๅฎๅ
จๅ
ผๅฎน) |
|
|
| ```python |
| from shinka.evaluation import EvaluationAgent, AgentConfig |
| |
| config = AgentConfig(mode="static") |
| agent = EvaluationAgent(config) |
| |
| metrics, correct, error = agent.evaluate( |
| program_path="gen_42/main.py", |
| results_dir="gen_42/results" |
| ) |
| |
| # ่พๅบไธ็ฐๆevaluate_with_auxiliary.pyๅฎๅ
จ็ธๅ |
| ``` |
|
|
| ### ็คบไพ2: ่ช้ๅบๆจกๅผ (ๆบ่ฝ่ฏไผฐ) |
|
|
| ```python |
| config = AgentConfig( |
| mode="adaptive", |
| enable_database_read=True, |
| llm_model="native-gemini-2.5-pro" |
| ) |
| |
| agent = EvaluationAgent( |
| config=config, |
| db_path="evolution_db.sqlite" |
| ) |
| |
| metrics, correct, error = agent.evaluate( |
| program_path="gen_100/main.py", |
| results_dir="gen_100/results" |
| ) |
| |
| # Agentไผ: |
| # 1. ๆฅ่ฏขๅ99ไปฃ็ๆไฝณ็จๅบ |
| # 2. ๅๆๅฝๅ็จๅบ็ธๅฏนๅๅฒ็ๆน่ฟ |
| # 3. ๆบ่ฝ้ๆฉๆ็ธๅ
ณ็auxiliary metrics |
| # 4. ็ๆไธชๆงๅ็ๅ้ฆ |
| ``` |
|
|
| ### ็คบไพ3: ๆข็ดขๆจกๅผ (่ชๅจๅ็ฐๆฐๆๆ ) |
|
|
| ```python |
| config = AgentConfig( |
| mode="exploratory", |
| enable_dynamic_metrics=True, |
| enable_database_read=True |
| ) |
| |
| agent = EvaluationAgent(config, db_path="evolution_db.sqlite") |
| |
| metrics, correct, error = agent.evaluate( |
| program_path="gen_150/main.py", |
| results_dir="gen_150/results" |
| ) |
| |
| # Agentๅฏ่ฝไผ: |
| # 1. ๅ็ฐ็ฐๆmetrics้ฝๅจplateau |
| # 2. ็ๆๆฐ็metricๆฅๆฃๆต"corner circle size pattern" |
| # 3. ้ช่ฏๆฐmetricไธไธปๅๆฐ็็ธๅ
ณๆง |
| # 4. ๅฆๆๆๆ๏ผๆณจๅๅฐๅ
จๅฑregistryไพๅ็ปญไฝฟ็จ |
| ``` |
|
|
| --- |
|
|
| ## ๐ก ไผๅฟๅๅฝฑๅ |
|
|
| ### ๅฏน่ฟๅ็ณป็ป็ๆน่ฟ |
|
|
| 1. **ๆดๆบ่ฝ็่ฏไผฐ**: Agentๅฏไปฅๆ นๆฎ่ฟๅ้ถๆฎต่ฐๆด่ฏไผฐ็ญ็ฅ |
| 2. **่ช้ๅบๅ้ฆ**: ้ๅฏนๅฝๅไปฃ็ๅ
ทไฝ้ฎ้ขๆไพ้ๅฏนๆงๅปบ่ฎฎ |
| 3. **่ชๅจๅ็ฐ**: ๆข็ดขๆฐ็่ฏไผฐ็ปดๅบฆ๏ผ็ช็ ดไบบๅทฅ่ฎพ่ฎก็ๅฑ้ |
| 4. **ๅฏ่งฃ้ๆง**: Agent็ๆจ็่ฟ็จๅฏ่ฟฝๆบฏ๏ผๆนไพฟ่ฐ่ฏ |
|
|
| ### ไฟๆๅ
ผๅฎนๆง |
|
|
| 1. **ๆฅๅฃๅ
ผๅฎน**: ๅฎๅ
จ้ตๅฎ็ฐๆ็่พๅ
ฅ่พๅบๅฅ็บฆ |
| 2. **ๆธ่ฟๅผ้็จ**: ๅฏไปฅไปstaticๆจกๅผๅผๅง๏ผ้ๆญฅๅฏ็จ้ซ็บงๅ่ฝ |
| 3. **ๆง่ฝๅฏๆง**: ๅฏไปฅ้
็ฝฎAgent็่ฎก็ฎ้ข็ฎ |
| 4. **ๆ ็ ดๅๆง**: ไธๅฝฑๅ็ฐๆๅฎ้ช็ๅฏๅค็ฐๆง |
|
|
| --- |
|
|
| ## ๐ ๆป็ป |
|
|
| ### Agent็ๆ ธๅฟๅฏนๅคๆฅๅฃ |
|
|
| ``` |
| ่พๅ
ฅๆฅๅฃ: |
| โโ ๅฟ
้: program_path, results_dir |
| โโ ๅฏ้: db_path, agent_config |
| |
| ่พๅบๆฅๅฃ: |
| โโ ๅฟ
้: metrics.json, correct.json |
| โโ ๅฏ้: agent_reasoning.json, visualizations/ |
| |
| ๆฐๆฎๅบๆฅๅฃ: |
| โโ READ: ๅฏ่ฏปๅๆๆๅๅฒ็จๅบๆฐๆฎ |
| โโ WRITE: ไป
ๅฏๅๅ
ฅprogram.metadataๅญๆฎต |
| |
| ๅทฅๅ
ทๆฅๅฃ: |
| โโ Ground Truth: ่ฟ่กๅ้ช่ฏ็จๅบ |
| โโ Auxiliary Metrics: ้ขๅฎไนๅๆๆๆ |
| โโ Database: ๆฅ่ฏขๅๅฒๆฐๆฎ |
| โโ Dynamic: ็ๆๆฐๆๆ |
| โโ Visualization: ๅๆๅๅฏ่งๅ |
| ``` |
|
|
| ### ๅ
ณ้ฎ่ฎพ่ฎกๅๅ |
|
|
| 1. **ๆฅๅฃๅ
ผๅฎนไผๅ
**: Agentๅฟ
้กป่ฝๅฎๅ
จๆฟไปฃ็ฐๆevaluation่ๆฌ |
| 2. **ๅฎๅ
จๆง**: ไปฃ็ ๆง่กๆฒ็ฎฑใๆฐๆฎๅบๆ้ๆงๅถ |
| 3. **ๅฏๆฉๅฑๆง**: ๅทฅๅ
ท็ณป็ปๆฏๆๆ็ปญๆทปๅ ๆฐ่ฝๅ |
| 4. **ๅฏ่งๆตๆง**: Agent็ๅณ็ญ่ฟ็จๅฏ่ฟฝๆบฏๅ่ฐ่ฏ |
| 5. **ๆง่ฝๅฏๆง**: ้่ฟ้
็ฝฎๅนณ่กกๆบ่ฝ็จๅบฆๅ่ฎก็ฎๆๆฌ |
|
|
| ### ๅฎ็ฐๅฏ่กๆง |
|
|
| โ
**ๆๆฏๅฏ่ก**: ๆๆ็ปไปถ้ฝๆๆ็็ๅฎ็ฐๆนๆก |
| โ
**ๆถๆๅๅฅฝ**: ไธ็ฐๆ็ณป็ปๆ ็ผ้ๆ |
| โ
**ๆธ่ฟๅผ**: ๅฏไปฅๅ้ถๆฎตๅฎ็ฐๅ้จ็ฝฒ |
| โ
**ๅๅๅ
ผๅฎน**: ไธ็ ดๅ็ฐๆๅฎ้ช |
|
|
| --- |
|
|
| **่ฟไธชAgentๅฐevaluationไปๅบๅฎๆต็จๆๅไธบๆบ่ฝๅณ็ญ่ฟ็จ๏ผๅๆถไฟๆไธ็ฐๆ็ณป็ป็ๅฎ็พๅ
ผๅฎน๏ผ** ๐ |
|
|