shinka-backup / docs /evaluation_agent_design.md

Add files using upload-large-folder tool

1556404 verified 22 days ago

41.2 kB

	# Evaluation Agent 设计方案

	## 📋 可行性分析

	结论：完全可行！将evaluation脚本改造成agent不仅技术上可行，而且可以显著增强系统的适应性和智能化程度。

	---

	## 🏗️ 当前架构分析

	### 当前Evaluation工作流

	```
	┌─────────────────────────────────────────────────────┐
	│ EvolutionRunner (控制器) │
	│ ├─ 生成新代码 (gen_N/main.py) │
	│ ├─ 提交job到JobScheduler │
	│ └─ 等待结果 │
	└────────────────┬────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────┐
	│ JobScheduler 执行命令: │
	│ python evaluate_with_auxiliary.py \ │
	│ --program_path gen_N/main.py \ │
	│ --results_dir gen_N/results │
	└────────────────┬────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────┐
	│ Evaluation Script (独立进程) │
	│ ├─ 加载程序 │
	│ ├─ 运行实验 (run_packing) │
	│ ├─ 验证结果 (validate_packing) │
	│ ├─ 计算metrics (固定的7个auxiliary metrics) │
	│ ├─ 生成文本反馈 │
	│ └─ 保存结果到文件: │
	│ • metrics.json │
	│ • correct.json │
	│ • extra.npz │
	│ • auxiliary_analysis.json │
	└────────────────┬────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────┐
	│ EvolutionRunner 读取结果 │
	│ ├─ 解析 metrics.json │
	│ ├─ 提取 combined_score, public_metrics │
	│ ├─ 写入数据库 (ProgramDatabase) │
	│ └─ 用于选择下一代父程序 │
	└─────────────────────────────────────────────────────┘
	```

	### 关键接口契约

	输入接口 (命令行参数):
	```python
	--program_path: str # 要评估的程序路径
	--results_dir: str # 结果保存目录
	--aux_config: str # (可选) 辅助评估配置
	```

	输出接口 (文件系统):
	```json
	# metrics.json
	{
	"combined_score": 2.635, # 主评分 (必须)
	"public": { # 公开指标 (LLM可见)
	"centers_str": "...",
	"num_circles": 26,
	"aux_packing_efficiency": 0.842,
	"aux_gap_analysis": 0.756,
	...
	},
	"private": { # 私有指标 (仅记录)
	"reported_sum_of_radii": 2.635
	},
	"text_feedback": "..." # (可选) 文本反馈
	}

	# correct.json
	{
	"correct": true,
	"error": null
	}
	```

	数据库Schema (Program表):
	```python
	@dataclass
	class Program:
	# 身份标识
	id: str
	code: str
	generation: int
	parent_id: Optional[str]

	# 评估结果 (由evaluation写入)
	combined_score: float
	public_metrics: Dict[str, Any]
	private_metrics: Dict[str, Any]
	text_feedback: str
	correct: bool

	# 辅助数据
	embedding: List[float]
	metadata: Dict[str, Any]

	# 进化关系
	archive_inspiration_ids: List[str]
	top_k_inspiration_ids: List[str]
	children_count: int
	```

	---

	## 🤖 Agent化改造方案

	### 核心设计理念

	Agent ≠ 脚本的区别:
	1. 自主决策: Agent能根据context决定分析策略
	2. 动态工具使用: Agent能调用不同工具、生成新代码
	3. 历史感知: Agent能访问数据库了解进化历史
	4. 元学习: Agent能改进自己的评估策略

	### Agent架构设计

	```
	┌─────────────────────────────────────────────────────────────┐
	│ EvaluationAgent (主控制器) │
	│ ┌───────────────────────────────────────────────────────┐ │
	│ │ Core Components: │ │
	│ │ • LLM (decision maker) │ │
	│ │ • Tool Registry (可调用的工具集) │ │
	│ │ • Database Access (读写历史数据) │ │
	│ │ • Code Executor (安全执行生成的代码) │ │
	│ └───────────────────────────────────────────────────────┘ │
	│ │
	│ ┌───────────────────────────────────────────────────────┐ │
	│ │ Workflow: │ │
	│ │ 1. 接收评估请求 (program_path, results_dir) │ │
	│ │ 2. 查询数据库获取context │ │
	│ │ 3. LLM规划评估策略 │ │
	│ │ 4. 执行评估步骤 (调用工具/生成代码) │ │
	│ │ 5. 聚合结果并生成反馈 │ │
	│ │ 6. (可选) 更新数据库元信息 │ │
	│ │ 7. 保存标准输出文件 │ │
	│ └───────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘

	Agent可用的工具 (Tools):

	┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
	│ Ground Truth │ │ Auxiliary │ │ Dynamic Metric │
	│ Evaluation │ │ Metrics │ │ Generator │
	│ • 运行程序 │ │ • 预定义指标 │ │ • LLM生成代码 │
	│ • 验证约束 │ │ • 注册系统 │ │ • 编译并执行 │
	│ • 计算主分数 │ │ │ │ • 安全沙箱 │
	└──────────────────┘ └──────────────────┘ └──────────────────┘

	┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
	│ Database Query │ │ Visualization │ │ Meta Analysis │
	│ • 查询历史 │ │ • 生成图表 │ │ • 趋势分析 │
	│ • 统计分析 │ │ • 保存可视化 │ │ • 策略推荐 │
	│ • 对比程序 │ │ │ │ │
	└──────────────────┘ └──────────────────┘ └──────────────────┘

	数据库访问权限:

	┌──────────────────────────────────────────────────────────┐
	│ Database Interface (ProgramDatabase) │
	│ ┌────────────────────────────────────────────────────┐ │
	│ │ READ Operations (Agent可用): │ │
	│ │ • get_all_programs() │ │
	│ │ • get_programs_by_generation(gen) │ │
	│ │ • get_top_programs(n, metric) │ │
	│ │ • get_best_program(metric) │ │
	│ │ • get_program(id) │ │
	│ │ • 自定义SQL查询 (受限) │ │
	│ └────────────────────────────────────────────────────┘ │
	│ ┌────────────────────────────────────────────────────┐ │
	│ │ WRITE Operations (谨慎使用): │ │
	│ │ • 只能写入metadata字段 │ │
	│ │ • 不能修改 combined_score, correct 等核心字段 │ │
	│ │ • 可以添加额外的分析结果到metadata │ │
	│ └────────────────────────────────────────────────────┘ │
	└──────────────────────────────────────────────────────────┘
	```

	---

	## 🔌 Agent对外接口设计

	### 1. 命令行接口 (保持兼容)

	```bash
	# 基本接口 (与当前完全兼容)
	python evaluation_agent.py \
	--program_path gen_42/main.py \
	--results_dir gen_42/results

	# 扩展接口 (新增功能)
	python evaluation_agent.py \
	--program_path gen_42/main.py \
	--results_dir gen_42/results \
	--db_path path/to/evolution.sqlite \ # Agent可访问数据库
	--agent_mode adaptive \ # 评估模式: static\|adaptive\|exploratory
	--enable_dynamic_metrics \ # 允许生成新metrics
	--feedback_style detailed # 反馈风格: minimal\|normal\|detailed
	```

	### 2. Python API接口

	```python
	from shinka.evaluation import EvaluationAgent, AgentConfig

	# 配置Agent
	agent_config = AgentConfig(
	# LLM配置
	llm_model="native-gemini-2.5-pro",
	llm_temperature=0.7,

	# 评估模式
	mode="adaptive", # static \| adaptive \| exploratory

	# 工具访问权限
	enable_ground_truth=True, # 必须
	enable_auxiliary_metrics=True, # 预定义辅助指标
	enable_dynamic_metrics=True, # LLM生成新指标
	enable_database_read=True, # 读取历史数据
	enable_database_write_metadata=False, # 写入元数据

	# 安全配置
	code_execution_timeout=30, # 生成代码执行超时
	max_tool_calls=20, # 最大工具调用次数
	sandboxed_execution=True, # 沙箱执行

	# 输出配置
	generate_text_feedback=True,
	save_detailed_analysis=True,
	visualization=True,
	)

	# 创建Agent
	agent = EvaluationAgent(
	config=agent_config,
	db_path="path/to/evolution.sqlite" # 可选
	)

	# 执行评估
	metrics, correct, error = agent.evaluate(
	program_path="gen_42/main.py",
	results_dir="gen_42/results"
	)

	# Agent会自动保存标准输出文件
	# - metrics.json
	# - correct.json
	# - auxiliary_analysis.json
	# - (可选) agent_reasoning.json # Agent的决策过程
	```

	### 3. EvolutionRunner集成接口

	```python
	from shinka.core import EvolutionRunner, EvolutionConfig
	from shinka.database import DatabaseConfig
	from shinka.launch import LocalJobConfig
	from shinka.evaluation import EvaluationAgentConfig # 新增

	# 配置job使用Agent评估器
	job_config = LocalJobConfig(
	eval_program_path="shinka/evaluation/agent_main.py", # Agent入口
	extra_cmd_args={
	"agent_mode": "adaptive",
	"enable_dynamic_metrics": True,
	"db_path": "auto", # 自动传递数据库路径
	}
	)

	# 数据库配置
	db_config = DatabaseConfig(
	db_path="evolution_db.sqlite",
	# ... 其他配置
	)

	# 进化配置
	evo_config = EvolutionConfig(
	# ... 其他配置
	use_text_feedback=True, # 接收Agent生成的反馈
	)

	# 运行时，Agent会自动获得:
	# 1. 当前generation的程序路径
	# 2. 数据库访问权限 (通过--db_path参数)
	# 3. 历史程序信息
	runner = EvolutionRunner(
	job_config=job_config,
	db_config=db_config,
	evo_config=evo_config,
	)
	runner.run()
	```

	---

	## 🛠️ Agent工具系统设计

	### 工具接口规范

	```python
	from typing import Any, Dict, Optional
	from dataclasses import dataclass

	@dataclass
	class ToolResult:
	"""工具执行结果"""
	success: bool
	data: Any
	error: Optional[str] = None
	cost: float = 0.0 # API成本

	class Tool:
	"""工具基类"""
	name: str
	description: str
	parameters: Dict[str, Any] # JSON Schema

	def execute(self, **kwargs) -> ToolResult:
	"""执行工具逻辑"""
	raise NotImplementedError
	```

	### 核心工具清单

	```python
	# ============================================================================
	# 1. GROUND TRUTH EVALUATION (必需工具)
	# ============================================================================

	class RunProgramTool(Tool):
	"""运行被评估程序并获取原始结果"""
	name = "run_program"
	description = "Execute the program and get raw results (centers, radii, score)"

	def execute(self, program_path: str, num_runs: int = 1) -> ToolResult:
	# 调用 run_shinka_eval 的底层逻辑
	# 返回: centers, radii, reported_score
	pass

	class ValidateResultsTool(Tool):
	"""验证程序输出是否满足约束"""
	name = "validate_results"
	description = "Validate if results satisfy all constraints"

	def execute(self, centers, radii) -> ToolResult:
	# 调用 adapted_validate_packing
	# 返回: is_valid, error_message
	pass

	# ============================================================================
	# 2. AUXILIARY METRICS (预定义分析工具)
	# ============================================================================

	class ComputeMetricTool(Tool):
	"""计算预定义的辅助指标"""
	name = "compute_metric"
	description = "Compute a predefined auxiliary metric"
	parameters = {
	"metric_name": {
	"type": "string",
	"enum": ["packing_efficiency", "gap_analysis", "edge_utilization", ...]
	}
	}

	def execute(self, metric_name: str, centers, radii) -> ToolResult:
	# 调用 METRIC_REGISTRY.get(metric_name)
	pass

	class ListMetricsTool(Tool):
	"""列出所有可用的预定义指标"""
	name = "list_metrics"

	def execute(self) -> ToolResult:
	return ToolResult(
	success=True,
	data=METRIC_REGISTRY.list_metrics()
	)

	# ============================================================================
	# 3. DATABASE ACCESS (历史数据工具)
	# ============================================================================

	class QueryDatabaseTool(Tool):
	"""查询数据库获取历史程序信息"""
	name = "query_database"
	description = "Query historical programs from database"
	parameters = {
	"query_type": {
	"type": "string",
	"enum": ["top_programs", "by_generation", "best_program", "all"]
	},
	"filters": {
	"type": "object",
	"properties": {
	"metric": {"type": "string"},
	"n": {"type": "integer"},
	"generation": {"type": "integer"}
	}
	}
	}

	def execute(self, query_type: str, filters: Dict) -> ToolResult:
	if query_type == "top_programs":
	programs = self.db.get_top_programs(
	n=filters.get("n", 10),
	metric=filters.get("metric", "combined_score")
	)
	elif query_type == "by_generation":
	programs = self.db.get_programs_by_generation(filters["generation"])
	# ...

	return ToolResult(
	success=True,
	data=[p.to_dict() for p in programs]
	)

	class CompareWithHistoryTool(Tool):
	"""对比当前程序与历史程序"""
	name = "compare_with_history"

	def execute(self, current_metrics: Dict, comparison_type: str) -> ToolResult:
	# comparison_type: "best" \| "parent" \| "generation_average"
	# 返回对比分析结果
	pass

	# ============================================================================
	# 4. DYNAMIC METRIC GENERATION (LLM生成新指标)
	# ============================================================================

	class GenerateMetricCodeTool(Tool):
	"""让LLM生成新的评估指标代码"""
	name = "generate_metric_code"
	description = "Generate Python code for a new evaluation metric"
	parameters = {
	"metric_purpose": {"type": "string"},
	"inspiration_from": {"type": "string"} # 参考已有指标
	}

	def execute(self, metric_purpose: str, inspiration_from: str = None) -> ToolResult:
	# 调用LLM生成新metric代码
	# 使用 LLMGeneratedMetric 框架
	prompt = f"""
	Generate a Python function to compute a new auxiliary metric for circle packing.

	Purpose: {metric_purpose}

	Requirements:
	1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
	2. Return MetricResult with name, value, interpretation, description, details
	3. Use numpy for computations
	4. Handle edge cases gracefully

	Example structure:
	```python
	def my_metric(centers, radii):
	# Your analysis logic here
	score = ...

	return MetricResult(
	name="my_metric",
	value=float(score),
	interpretation="higher_better",
	description="What this metric measures",
	details={{"key": "value"}}
	)
	```
	"""

	llm_response = self.llm.query(prompt)
	code = extract_code_from_response(llm_response)

	return ToolResult(
	success=True,
	data={"code": code, "cost": llm_response.cost}
	)

	class CompileAndTestMetricTool(Tool):
	"""编译并测试LLM生成的指标代码"""
	name = "compile_and_test_metric"

	def execute(self, code: str, test_data: Dict) -> ToolResult:
	metric = LLMGeneratedMetric(
	name="llm_metric",
	code=code,
	description="LLM generated metric",
	interpretation="higher_better"
	)

	if not metric.compile():
	return ToolResult(success=False, error="Compilation failed")

	# 测试执行
	try:
	result = metric.evaluate(
	centers=test_data["centers"],
	radii=test_data["radii"]
	)
	return ToolResult(success=True, data=result)
	except Exception as e:
	return ToolResult(success=False, error=str(e))

	# ============================================================================
	# 5. VISUALIZATION & ANALYSIS (分析工具)
	# ============================================================================

	class VisualizeTool(Tool):
	"""生成可视化"""
	name = "visualize"

	def execute(self, vis_type: str, data: Dict, output_path: str) -> ToolResult:
	# vis_type: "packing" \| "metrics_trend" \| "comparison"
	pass

	class StatisticalAnalysisTool(Tool):
	"""统计分析工具"""
	name = "statistical_analysis"

	def execute(self, data: List[float], analysis_type: str) -> ToolResult:
	# analysis_type: "trend" \| "distribution" \| "correlation"
	pass

	# ============================================================================
	# 6. META OPERATIONS (元操作)
	# ============================================================================

	class UpdateMetadataTool(Tool):
	"""更新程序的metadata字段"""
	name = "update_metadata"
	description = "Add analysis results to program metadata (write to DB)"

	def execute(self, program_id: str, metadata: Dict) -> ToolResult:
	# 仅允许写入metadata字段，不能修改核心评估字段
	program = self.db.get_program(program_id)
	if program:
	program.metadata.update(metadata)
	# 写回数据库
	# 注意：需要扩展ProgramDatabase添加update_metadata方法
	pass
	```

	---

	## 🧠 Agent决策流程

	### Mode 1: Static Mode (兼容模式)

	```python
	def static_evaluation(agent, program_path, results_dir):
	"""
	完全兼容现有evaluation脚本的行为
	"""
	# 1. 运行程序
	result = agent.tools["run_program"].execute(program_path)
	centers, radii, score = result.data

	# 2. 验证结果
	validation = agent.tools["validate_results"].execute(centers, radii)
	correct = validation.data["is_valid"]

	# 3. 计算预定义auxiliary metrics
	auxiliary_results = {}
	for metric_name in agent.config.enabled_metrics:
	metric_result = agent.tools["compute_metric"].execute(
	metric_name, centers, radii
	)
	auxiliary_results[metric_name] = metric_result.data.value

	# 4. 生成标准反馈
	feedback = generate_standard_feedback(auxiliary_results, score)

	# 5. 保存结果
	metrics = {
	"combined_score": score,
	"public": {
	"centers_str": format_centers_string(centers),
	"num_circles": len(centers),
	**{f"aux_{k}": v for k, v in auxiliary_results.items()}
	},
	"private": {"reported_sum_of_radii": score},
	"text_feedback": feedback
	}

	save_metrics(results_dir, metrics, correct)
	return metrics, correct
	```

	### Mode 2: Adaptive Mode (智能模式)

	```python
	def adaptive_evaluation(agent, program_path, results_dir, db_path):
	"""
	Agent根据context智能决策评估策略
	"""
	# 1. 获取context
	context = agent.gather_context(program_path, db_path)

	# 2. LLM规划评估策略
	plan = agent.llm.plan_evaluation(context)

	# 示例plan:
	# {
	# "steps": [
	# {"action": "run_program", "params": {...}},
	# {"action": "query_database", "params": {"query_type": "best_program"}},
	# {"action": "compute_metric", "params": {"metric_name": "packing_efficiency"}},
	# {"action": "compare_with_history", "params": {"comparison_type": "best"}},
	# {"action": "generate_feedback", "params": {...}}
	# ]
	# }

	# 3. 执行plan
	execution_log = []
	for step in plan["steps"]:
	tool = agent.tools[step["action"]]
	result = tool.execute(**step["params"])
	execution_log.append(result)

	# 如果某步失败，LLM可以调整策略
	if not result.success:
	plan = agent.llm.replan(plan, execution_log, result.error)

	# 4. LLM聚合结果并生成反馈
	final_metrics, feedback = agent.llm.aggregate_results(execution_log, context)

	# 5. 保存结果 (保证接口兼容性)
	save_metrics(results_dir, final_metrics, correct)

	# 6. (可选) 保存Agent推理过程
	save_agent_reasoning(results_dir, plan, execution_log)

	return final_metrics, correct
	```

	### Mode 3: Exploratory Mode (探索模式)

	```python
	def exploratory_evaluation(agent, program_path, results_dir, db_path):
	"""
	Agent主动探索新的评估方法
	"""
	# 1. 标准评估
	base_metrics, correct = adaptive_evaluation(agent, program_path, results_dir, db_path)

	# 2. 分析历史趋势
	trend_analysis = agent.tools["statistical_analysis"].execute(
	data=get_historical_scores(agent.db),
	analysis_type="trend"
	)

	# 3. 如果发现评估盲点，生成新metric
	if agent.detect_evaluation_gap(trend_analysis):
	# LLM生成新metric代码
	new_metric_code = agent.tools["generate_metric_code"].execute(
	metric_purpose="Identify patterns missed by existing metrics"
	)

	# 编译并测试
	test_result = agent.tools["compile_and_test_metric"].execute(
	code=new_metric_code.data["code"],
	test_data={"centers": centers, "radii": radii}
	)

	if test_result.success:
	# 注册新metric到全局registry
	register_new_metric(new_metric_code.data["code"])

	# 重新评估包含新metric
	extended_metrics = compute_with_new_metric(centers, radii)
	base_metrics["public"].update(extended_metrics)

	# 4. 保存扩展结果
	save_metrics(results_dir, base_metrics, correct)

	return base_metrics, correct
	```

	---

	## 🔒 安全性设计

	### 代码执行沙箱

	```python
	class SafeCodeExecutor:
	"""安全的代码执行环境"""

	def __init__(self, timeout=30):
	self.timeout = timeout
	self.allowed_imports = {
	'numpy', 'scipy', 'math', 'statistics'
	}
	self.forbidden_operations = {
	'__import__', 'eval', 'exec', 'compile',
	'open', 'file', 'input', 'raw_input'
	}

	def execute(self, code: str, inputs: Dict) -> Any:
	"""在受限环境中执行代码"""
	# 1. 静态分析检查
	if self.has_forbidden_operations(code):
	raise SecurityError("Forbidden operations detected")

	# 2. 创建受限namespace
	namespace = {
	'np': numpy,
	'MetricResult': MetricResult,
	# ... 只提供必要的模块
	}
	namespace.update(inputs)

	# 3. 超时执行
	with timeout(self.timeout):
	exec(code, namespace)

	return namespace
	```

	### 数据库访问权限控制

	```python
	class RestrictedDatabaseAccess:
	"""受限的数据库访问接口"""

	def __init__(self, db: ProgramDatabase):
	self.db = db
	self.read_only_methods = [
	'get_all_programs', 'get_programs_by_generation',
	'get_top_programs', 'get_best_program', 'get_program'
	]
	self.write_allowed_fields = ['metadata'] # 只能写metadata

	def __getattr__(self, name):
	if name in self.read_only_methods:
	return getattr(self.db, name)
	else:
	raise PermissionError(f"Method {name} not allowed for agent")

	def update_metadata(self, program_id: str, metadata: Dict):
	"""唯一允许的写入操作"""
	program = self.db.get_program(program_id)
	if program:
	program.metadata.update(metadata)
	# 需要在ProgramDatabase中添加此方法
	self.db.update_program_metadata(program_id, program.metadata)
	```

	---

	## 📊 Agent与外界的数据流

	```
	┌─────────────────────────────────────────────────────────────┐
	│ EvolutionRunner (主系统) │
	│ │
	│ [每一代进化] │
	│ ├─ 生成新代码: gen_N/main.py │
	│ ├─ 调用Agent评估 ──────────────────────────┐ │
	│ │ ▼ │
	│ │ ┌──────────────────────────┐ │
	│ │ │ EvaluationAgent │ │
	│ │ │ (独立进程) │ │
	│ │ │ │ │
	│ │ │ 输入: │ │
	│ │ │ • program_path │ │
	│ │ │ • results_dir │ │
	│ │ │ • db_path (可选) │ │
	│ │ │ │ │
	│ │ │ Agent内部流程: │ │
	│ │ │ 1. 加载程序 │ │
	│ │ │ 2. 运行评估 │ │
	│ │ 读取数据库 ◄───────────┼ 3. 查询DB历史 ──┐ │ │
	│ │ │ 4. LLM规划 │ │ │
	│ │ │ 5. 工具调用 │ │ │
	│ │ │ 6. 聚合结果 │ │ │
	│ │ (可选)写metadata ◄─────┼ 7. 保存输出 │ │ │
	│ │ │ │ │ │
	│ │ │ 输出文件: │ │ │
	│ │ │ • metrics.json │ │ │
	│ │ │ • correct.json │ │ │
	│ │ │ • agent_log.json│ │ │
	│ │ └──────────────────┼───────┘ │
	│ │ │ │
	│ ├─ 读取评估结果 ◄────────────────────────────────┘ │
	│ │ • combined_score │
	│ │ • public_metrics (含aux metrics) │
	│ │ • text_feedback │
	│ │ │
	│ ├─ 写入数据库 (ProgramDatabase) │
	│ │ • 创建新Program记录 │
	│ │ • 保存所有metrics │
	│ │ • 更新archive │
	│ │ │
	│ └─ 选择父代 → 下一代 ────────────────────────► │
	└─────────────────────────────────────────────────────────────┘

	数据库 Schema (共享状态):

	┌──────────────────────────────────────────────────────────┐
	│ SQLite: evolution_db.sqlite │
	│ ┌────────────────────────────────────────────────────┐ │
	│ │ programs 表 │ │
	│ │ ├─ id (gen_N) │ │
	│ │ ├─ code │ │
	│ │ ├─ generation (N) │ │
	│ │ ├─ combined_score ◄── EvolutionRunner写入 │ │
	│ │ ├─ public_metrics ◄── EvolutionRunner写入 │ │
	│ │ ├─ text_feedback ◄── EvolutionRunner写入 │ │
	│ │ ├─ correct ◄── EvolutionRunner写入 │ │
	│ │ │ │ │
	│ │ └─ metadata ◄── Agent可写入 (可选) │ │
	│ │ { │ │
	│ │ "agent_analysis": {...}, │ │
	│ │ "custom_metrics": {...}, │ │
	│ │ "evaluation_reasoning": "..." │ │
	│ │ } │ │
	│ └────────────────────────────────────────────────────┘ │
	│ │
	│ Agent可读取全部历史数据，但只能写入metadata字段 │
	└──────────────────────────────────────────────────────────┘
	```

	---

	## 🎯 Agent对外接口总结

	### 必需接口 (保持兼容)

	```python
	# 1. 命令行参数接口
	--program_path: str # 必需
	--results_dir: str # 必需

	# 2. 输出文件接口 (标准契约)
	metrics.json: {
	"combined_score": float, # 必需
	"public": dict, # 必需
	"private": dict, # 可选
	"text_feedback": str # 可选 (use_text_feedback=True时)
	}
	correct.json: {
	"correct": bool, # 必需
	"error": str \| null # 必需
	}
	```

	### 扩展接口 (Agent特性)

	```python
	# 1. 数据库访问接口
	--db_path: str # 可选，提供后Agent可访问历史数据

	# 2. Agent模式配置
	--agent_mode: str # static \| adaptive \| exploratory
	--enable_dynamic_metrics: bool
	--max_tool_calls: int

	# 3. 额外输出文件
	agent_reasoning.json: { # Agent的决策过程 (用于调试和分析)
	"plan": [...],
	"execution_log": [...],
	"tool_costs": {...},
	"total_cost": float
	}

	auxiliary_analysis.json # 详细的辅助分析 (已有)

	visualizations/ # 可视化文件 (可选)
	├─ packing_viz.png
	├─ metrics_trend.png
	└─ comparison.png
	```

	### Python API接口

	```python
	# 1. Agent类接口
	class EvaluationAgent:
	def __init__(
	self,
	config: AgentConfig,
	db_path: Optional[str] = None
	):
	pass

	def evaluate(
	self,
	program_path: str,
	results_dir: str
	) -> Tuple[Dict, bool, Optional[str]]:
	"""
	返回: (metrics, correct, error)
	与 run_shinka_eval 完全兼容
	"""
	pass

	# 2. 工具接口 (供Agent内部使用)
	class Tool:
	def execute(self, **kwargs) -> ToolResult:
	pass

	# 3. 数据库接口扩展
	class ProgramDatabase:
	# 新增方法供Agent使用
	def update_program_metadata(
	self,
	program_id: str,
	metadata: Dict
	) -> bool:
	pass
	```

	---

	## 🚀 实现路线图

	### Phase 1: 基础Agent框架 (2-3天)

	```
	✓ 1. 创建 EvaluationAgent 类骨架
	✓ 2. 实现 Tool 基类和工具注册系统
	✓ 3. 重构现有evaluation代码为工具
	- RunProgramTool
	- ValidateResultsTool
	- ComputeMetricTool
	✓ 4. 实现 static_mode (完全兼容现有行为)
	✓ 5. 单元测试
	```

	### Phase 2: 数据库集成 (1-2天)

	```
	✓ 1. 创建 RestrictedDatabaseAccess 接口
	✓ 2. 实现数据库查询工具
	- QueryDatabaseTool
	- CompareWithHistoryTool
	✓ 3. 扩展 ProgramDatabase.update_program_metadata()
	✓ 4. 集成测试
	```

	### Phase 3: Adaptive Mode (3-4天)

	```
	✓ 1. 实现 LLM planning 逻辑
	✓ 2. Context gathering (历史数据分析)
	✓ 3. 动态工具调用
	✓ 4. 结果聚合和反馈生成
	✓ 5. 端到端测试
	```

	### Phase 4: Dynamic Metrics (2-3天)

	```
	✓ 1. 实现 GenerateMetricCodeTool
	✓ 2. SafeCodeExecutor 沙箱
	✓ 3. 动态metric注册和验证
	✓ 4. Exploratory mode 实现
	✓ 5. 安全性测试
	```

	### Phase 5: 可视化和分析 (1-2天)

	```
	✓ 1. VisualizeTool
	✓ 2. StatisticalAnalysisTool
	✓ 3. Agent推理过程可视化
	```

	### Phase 6: 生产就绪 (2-3天)

	```
	✓ 1. 性能优化
	✓ 2. 错误处理和恢复
	✓ 3. 日志和监控
	✓ 4. 文档完善
	✓ 5. 集成到EvolutionRunner
	```

	总计: 11-17天开发时间

	---

	## 📝 使用示例

	### 示例1: 静态模式 (完全兼容)

	```python
	from shinka.evaluation import EvaluationAgent, AgentConfig

	config = AgentConfig(mode="static")
	agent = EvaluationAgent(config)

	metrics, correct, error = agent.evaluate(
	program_path="gen_42/main.py",
	results_dir="gen_42/results"
	)

	# 输出与现有evaluate_with_auxiliary.py完全相同
	```

	### 示例2: 自适应模式 (智能评估)

	```python
	config = AgentConfig(
	mode="adaptive",
	enable_database_read=True,
	llm_model="native-gemini-2.5-pro"
	)

	agent = EvaluationAgent(
	config=config,
	db_path="evolution_db.sqlite"
	)

	metrics, correct, error = agent.evaluate(
	program_path="gen_100/main.py",
	results_dir="gen_100/results"
	)

	# Agent会:
	# 1. 查询前99代的最佳程序
	# 2. 分析当前程序相对历史的改进
	# 3. 智能选择最相关的auxiliary metrics
	# 4. 生成个性化的反馈
	```

	### 示例3: 探索模式 (自动发现新指标)

	```python
	config = AgentConfig(
	mode="exploratory",
	enable_dynamic_metrics=True,
	enable_database_read=True
	)

	agent = EvaluationAgent(config, db_path="evolution_db.sqlite")

	metrics, correct, error = agent.evaluate(
	program_path="gen_150/main.py",
	results_dir="gen_150/results"
	)

	# Agent可能会:
	# 1. 发现现有metrics都在plateau
	# 2. 生成新的metric来检测"corner circle size pattern"
	# 3. 验证新metric与主分数的相关性
	# 4. 如果有效，注册到全局registry供后续使用
	```

	---

	## 💡 优势和影响

	### 对进化系统的改进

	1. 更智能的评估: Agent可以根据进化阶段调整评估策略
	2. 自适应反馈: 针对当前代的具体问题提供针对性建议
	3. 自动发现: 探索新的评估维度，突破人工设计的局限
	4. 可解释性: Agent的推理过程可追溯，方便调试

	### 保持兼容性

	1. 接口兼容: 完全遵守现有的输入输出契约
	2. 渐进式采用: 可以从static模式开始，逐步启用高级功能
	3. 性能可控: 可以配置Agent的计算预算
	4. 无破坏性: 不影响现有实验的可复现性

	---

	## 🎓 总结

	### Agent的核心对外接口

	```
	输入接口:
	├─ 必需: program_path, results_dir
	└─ 可选: db_path, agent_config

	输出接口:
	├─ 必需: metrics.json, correct.json
	└─ 可选: agent_reasoning.json, visualizations/

	数据库接口:
	├─ READ: 可读取所有历史程序数据
	└─ WRITE: 仅可写入program.metadata字段

	工具接口:
	├─ Ground Truth: 运行和验证程序
	├─ Auxiliary Metrics: 预定义分析指标
	├─ Database: 查询历史数据
	├─ Dynamic: 生成新指标
	└─ Visualization: 分析和可视化
	```

	### 关键设计原则

	1. 接口兼容优先: Agent必须能完全替代现有evaluation脚本
	2. 安全性: 代码执行沙箱、数据库权限控制
	3. 可扩展性: 工具系统支持持续添加新能力
	4. 可观测性: Agent的决策过程可追溯和调试
	5. 性能可控: 通过配置平衡智能程度和计算成本

	### 实现可行性

	✅ 技术可行: 所有组件都有成熟的实现方案
	✅ 架构友好: 与现有系统无缝集成
	✅ 渐进式: 可以分阶段实现和部署
	✅ 向后兼容: 不破坏现有实验

	---

	这个Agent将evaluation从固定流程提升为智能决策过程，同时保持与现有系统的完美兼容！ 🚀