shinka-backup / docs /agent_framework_comparison.md

Add files using upload-large-folder tool

1556404 verified 19 days ago

19.9 kB

	# Agent Framework 对比与选型

	## 🎯 需求回顾

	我们需要一个Agent框架来实现Evaluation Agent，核心需求：

	1. 工具调用系统: 需要执行多种工具（运行程序、查询数据库、生成代码等）
	2. LLM集成: 需要调用多种LLM（Gemini、GPT等）
	3. 沙箱执行: 需要安全执行LLM生成的代码
	4. 可扩展性: 容易添加新工具
	5. 接口兼容: 能封装成命令行工具，输出JSON文件
	6. 成本控制: 能限制API调用次数和成本

	---

	## 📊 开源Agent框架对比

	### 1. OpenHands (原OpenDevin)

	官网: https://github.com/All-Hands-AI/OpenHands

	特点:
	```python
	✅ 优势:
	- 专注于代码相关任务（非常适合evaluation场景）
	- 内置沙箱环境（Docker-based）
	- 支持多种LLM后端
	- Agent可以执行bash命令、读写文件、运行Python
	- 强大的工具系统

	❌ 劣势:
	- 较重量级（需要Docker环境）
	- 主要设计用于软件开发任务
	- 可能over-engineering for evaluation场景
	```

	适用场景: ⭐⭐⭐☆☆ (3.5/5)
	- 如果需要复杂的代码生成和执行
	- 需要隔离的沙箱环境
	- 但对于evaluation任务可能有点重

	---

	### 2. LangGraph (LangChain生态)

	官网: https://github.com/langchain-ai/langgraph

	特点:
	```python
	✅ 优势:
	- 图状态机设计，流程清晰
	- LangChain生态，工具丰富
	- 轻量级，易于集成
	- 状态持久化
	- 支持循环和条件分支
	- 强大的工具调用能力

	✅ 特别适合:
	- 需要复杂决策流程的Agent
	- 多步推理和工具调用
	- 状态管理和回溯
	```

	示例架构:
	```python
	from langgraph.graph import StateGraph, END

	# 定义状态
	class EvaluationState(TypedDict):
	program_path: str
	results_dir: str
	db_context: dict
	metrics: dict
	reasoning: list

	# 定义节点（每个节点是一个工具调用或决策点）
	def run_program(state):
	result = run_shinka_eval(state["program_path"])
	return {"metrics": result}

	def query_database(state):
	context = db.get_historical_context()
	return {"db_context": context}

	def llm_analyze(state):
	plan = llm.plan(state["db_context"], state["metrics"])
	return {"reasoning": plan}

	# 构建图
	workflow = StateGraph(EvaluationState)
	workflow.add_node("run_program", run_program)
	workflow.add_node("query_db", query_database)
	workflow.add_node("analyze", llm_analyze)
	workflow.add_edge("run_program", "query_db")
	workflow.add_edge("query_db", "analyze")
	workflow.add_edge("analyze", END)

	agent = workflow.compile()
	```

	适用场景: ⭐⭐⭐⭐⭐ (5/5)
	- 强烈推荐！完美契合evaluation agent需求
	- 轻量级但功能完整
	- 流程清晰，易于调试

	---

	### 3. CrewAI

	官网: https://github.com/joaomdmoura/crewAI

	特点:
	```python
	✅ 优势:
	- 多Agent协作（可以有专门的evaluator、analyzer等角色）
	- 内置角色和任务系统
	- 易于使用

	❌ 劣势:
	- 主要用于多Agent场景
	- 对于单一evaluation agent可能过度设计
	```

	适用场景: ⭐⭐⭐☆☆ (3/5)
	- 如果未来要扩展为多个评估agent协作
	- 但目前单agent场景可能不需要

	---

	### 4. Autogen (Microsoft)

	官网: https://github.com/microsoft/autogen

	特点:
	```python
	✅ 优势:
	- 微软出品，维护良好
	- 支持多Agent对话
	- 代码执行环境
	- 工具调用系统

	❌ 劣势:
	- 主要为对话场景设计
	- 相对复杂
	```

	适用场景: ⭐⭐⭐☆☆ (3/5)

	---

	### 5. Semantic Kernel (Microsoft)

	官网: https://github.com/microsoft/semantic-kernel

	特点:
	```python
	✅ 优势:
	- 轻量级插件系统
	- 多语言支持（Python, C#, Java）
	- 企业级设计
	- 函数调用和规划

	❌ 劣势:
	- 相对底层，需要更多自己实现
	- 文档相对少
	```

	适用场景: ⭐⭐⭐⭐☆ (4/5)

	---

	### 6. 自建轻量Agent框架

	特点:
	```python
	✅ 优势:
	- 完全控制
	- 轻量级
	- 针对性强
	- 无额外依赖

	❌ 劣势:
	- 需要从头实现工具系统
	- 缺少现成的patterns
	```

	适用场景: ⭐⭐⭐☆☆ (3/5)

	---

	## 🏆 推荐方案

	### 方案A: LangGraph (强烈推荐) ⭐⭐⭐⭐⭐

	理由:
	1. ✅ 完美契合需求: 图状态机设计天然适合evaluation流程
	2. ✅ 工具生态丰富: LangChain有大量现成工具
	3. ✅ 轻量级: 不需要Docker等重型依赖
	4. ✅ 可观测性强: 内置状态追踪和可视化
	5. ✅ 易于扩展: 添加新工具只需定义新节点
	6. ✅ 成本控制: 支持token计数和预算限制

	架构示例:

	```python
	# evaluation_agent_langgraph.py
	from typing import TypedDict, Annotated, List
	from langgraph.graph import StateGraph, END
	from langgraph.prebuilt import ToolExecutor
	from langchain_core.tools import tool
	import operator

	# ============================================================================
	# 1. 定义状态
	# ============================================================================

	class EvaluationState(TypedDict):
	"""Agent的状态"""
	# 输入
	program_path: str
	results_dir: str
	db_path: str

	# 中间状态
	program_result: dict
	db_context: dict
	auxiliary_metrics: dict

	# Agent推理
	plan: List[str]
	tool_calls: Annotated[list, operator.add] # 累积工具调用

	# 输出
	final_metrics: dict
	correct: bool
	error: str \| None

	# 元信息
	total_cost: float
	reasoning: List[str]


	# ============================================================================
	# 2. 定义工具（LangChain格式）
	# ============================================================================

	@tool
	def run_program(program_path: str) -> dict:
	"""Run the program and get raw results (centers, radii, score)"""
	from shinka.core import run_shinka_eval

	# 调用现有evaluation逻辑
	result = run_shinka_eval(
	program_path=program_path,
	results_dir="temp",
	experiment_fn_name="run_packing",
	num_runs=1,
	# ...
	)
	return {
	"centers": result["centers"].tolist(),
	"radii": result["radii"].tolist(),
	"score": result["score"]
	}

	@tool
	def validate_packing(centers: list, radii: list) -> dict:
	"""Validate if packing satisfies all constraints"""
	from examples.circle_packing.evaluate import adapted_validate_packing

	is_valid, error = adapted_validate_packing((
	np.array(centers),
	np.array(radii),
	sum(radii)
	))

	return {"valid": is_valid, "error": error}

	@tool
	def query_historical_best(db_path: str, metric: str = "combined_score") -> dict:
	"""Query the best historical program from database"""
	from shinka.database import ProgramDatabase

	db = ProgramDatabase.load(db_path)
	best_program = db.get_best_program(metric=metric)

	return {
	"id": best_program.id,
	"score": best_program.combined_score,
	"generation": best_program.generation,
	"metrics": best_program.public_metrics
	}

	@tool
	def compute_auxiliary_metric(metric_name: str, centers: list, radii: list) -> dict:
	"""Compute a predefined auxiliary metric"""
	from examples.circle_packing.auxiliary_eval import METRIC_REGISTRY

	metric_func = METRIC_REGISTRY.get(metric_name)
	if not metric_func:
	return {"error": f"Metric {metric_name} not found"}

	result = metric_func(np.array(centers), np.array(radii))
	return {
	"name": result.name,
	"value": result.value,
	"interpretation": result.interpretation,
	"details": result.details
	}

	@tool
	def generate_new_metric_code(purpose: str, context: dict) -> dict:
	"""Use LLM to generate code for a new evaluation metric"""
	from shinka.llm import LLM

	llm = LLM("native-gemini-2.5-pro")

	prompt = f"""
	Generate a Python function for a new circle packing evaluation metric.

	Purpose: {purpose}
	Current context: {context}

	Requirements:
	1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
	2. Use numpy for computations
	3. Return MetricResult with name, value, interpretation, description

	Generate clean, executable code.
	"""

	response = llm.query(prompt)
	code = extract_code(response.content)

	return {"code": code, "cost": response.cost}


	# ============================================================================
	# 3. 定义节点（Agent的行为）
	# ============================================================================

	def run_program_node(state: EvaluationState) -> dict:
	"""执行程序并获取结果"""
	result = run_program.invoke({"program_path": state["program_path"]})
	return {
	"program_result": result,
	"tool_calls": [{"tool": "run_program", "result": result}]
	}

	def query_database_node(state: EvaluationState) -> dict:
	"""查询数据库获取历史context"""
	if not state.get("db_path"):
	return {"db_context": {}}

	best = query_historical_best.invoke({"db_path": state["db_path"]})

	return {
	"db_context": {"best_program": best},
	"tool_calls": [{"tool": "query_historical_best", "result": best}]
	}

	def llm_planning_node(state: EvaluationState) -> dict:
	"""LLM规划下一步评估策略"""
	from shinka.llm import LLM

	llm = LLM("native-gemini-2.5-pro", temperature=0.7)

	prompt = f"""
	You are evaluating a circle packing program.

	Current result:
	- Score: {state["program_result"]["score"]}
	- Number of circles: {len(state["program_result"]["centers"])}

	Historical context:
	- Best score: {state["db_context"].get("best_program", {}).get("score", "unknown")}

	Available tools:
	- compute_auxiliary_metric: Compute metrics like packing_efficiency, gap_analysis, etc.
	- generate_new_metric_code: Generate new evaluation metric if needed

	Decide which metrics to compute and what analysis to perform.
	Output a JSON plan with steps.
	"""

	response = llm.query(prompt)
	plan = parse_plan(response.content)

	return {
	"plan": plan,
	"reasoning": [f"LLM planning: {response.content}"],
	"total_cost": state.get("total_cost", 0) + response.cost
	}

	def execute_metrics_node(state: EvaluationState) -> dict:
	"""执行auxiliary metrics计算"""
	metrics = {}

	for metric_name in state.get("plan", []):
	if metric_name.startswith("compute_"):
	metric_result = compute_auxiliary_metric.invoke({
	"metric_name": metric_name.replace("compute_", ""),
	"centers": state["program_result"]["centers"],
	"radii": state["program_result"]["radii"]
	})
	metrics[metric_name] = metric_result

	return {
	"auxiliary_metrics": metrics,
	"tool_calls": [{"tool": "compute_auxiliary_metric", "results": metrics}]
	}

	def generate_feedback_node(state: EvaluationState) -> dict:
	"""生成最终反馈"""
	from shinka.llm import LLM

	llm = LLM("native-gemini-2.5-pro", temperature=0.7)

	prompt = f"""
	Generate evaluation feedback for a circle packing program.

	Results:
	- Primary score: {state["program_result"]["score"]}
	- Auxiliary metrics: {state["auxiliary_metrics"]}
	- Historical best: {state["db_context"].get("best_program", {}).get("score")}

	Provide:
	1. Performance summary
	2. Comparison with historical best
	3. Specific actionable recommendations
	"""

	response = llm.query(prompt)

	return {
	"final_metrics": {
	"combined_score": state["program_result"]["score"],
	"public": {
	"num_circles": len(state["program_result"]["centers"]),
	**state["auxiliary_metrics"]
	},
	"text_feedback": response.content
	},
	"correct": True,
	"error": None,
	"total_cost": state.get("total_cost", 0) + response.cost
	}

	def save_results_node(state: EvaluationState) -> dict:
	"""保存结果到文件（保证接口兼容）"""
	from shinka.core.wrap_eval import save_json_results

	save_json_results(
	results_dir=state["results_dir"],
	metrics=state["final_metrics"],
	correct=state["correct"],
	error=state["error"]
	)

	# 保存Agent推理过程
	import json
	with open(f"{state['results_dir']}/agent_reasoning.json", "w") as f:
	json.dump({
	"plan": state.get("plan", []),
	"tool_calls": state.get("tool_calls", []),
	"reasoning": state.get("reasoning", []),
	"total_cost": state.get("total_cost", 0)
	}, f, indent=2)

	return {}


	# ============================================================================
	# 4. 构建图
	# ============================================================================

	def create_evaluation_agent():
	"""创建evaluation agent workflow"""

	workflow = StateGraph(EvaluationState)

	# 添加节点
	workflow.add_node("run_program", run_program_node)
	workflow.add_node("query_database", query_database_node)
	workflow.add_node("llm_planning", llm_planning_node)
	workflow.add_node("execute_metrics", execute_metrics_node)
	workflow.add_node("generate_feedback", generate_feedback_node)
	workflow.add_node("save_results", save_results_node)

	# 定义流程
	workflow.set_entry_point("run_program")
	workflow.add_edge("run_program", "query_database")
	workflow.add_edge("query_database", "llm_planning")
	workflow.add_edge("llm_planning", "execute_metrics")
	workflow.add_edge("execute_metrics", "generate_feedback")
	workflow.add_edge("generate_feedback", "save_results")
	workflow.add_edge("save_results", END)

	return workflow.compile()


	# ============================================================================
	# 5. 主入口（命令行兼容）
	# ============================================================================

	def main():
	import argparse

	parser = argparse.ArgumentParser()
	parser.add_argument("--program_path", required=True)
	parser.add_argument("--results_dir", required=True)
	parser.add_argument("--db_path", default=None)
	parser.add_argument("--agent_mode", default="adaptive")
	args = parser.parse_args()

	# 创建agent
	agent = create_evaluation_agent()

	# 初始状态
	initial_state = {
	"program_path": args.program_path,
	"results_dir": args.results_dir,
	"db_path": args.db_path,
	"tool_calls": [],
	"total_cost": 0.0,
	"reasoning": []
	}

	# 运行agent
	final_state = agent.invoke(initial_state)

	print(f"✅ Evaluation completed!")
	print(f"Score: {final_state['final_metrics']['combined_score']}")
	print(f"Total cost: ${final_state['total_cost']:.4f}")
	print(f"Results saved to: {args.results_dir}")


	if __name__ == "__main__":
	main()
	```

	集成到现有系统:

	```python
	# my/run_circle_packing_WITH_agent.py
	from shinka.core import EvolutionRunner, EvolutionConfig
	from shinka.database import DatabaseConfig
	from shinka.launch import LocalJobConfig

	# 使用LangGraph agent作为evaluator
	job_config = LocalJobConfig(
	eval_program_path="evaluation_agent_langgraph.py", # LangGraph agent
	extra_cmd_args={
	"db_path": "auto", # 自动传递数据库路径
	"agent_mode": "adaptive"
	}
	)

	db_config = DatabaseConfig(
	db_path="evolution_db.sqlite",
	# ... 其他配置
	)

	evo_config = EvolutionConfig(
	use_text_feedback=True,
	# ... 其他配置
	)

	runner = EvolutionRunner(
	job_config=job_config,
	db_config=db_config,
	evo_config=evo_config
	)

	runner.run()
	```

	---

	### 方案B: OpenHands (如果需要强隔离) ⭐⭐⭐☆☆

	使用场景: 如果需要：
	- 完全隔离的沙箱环境
	- 执行不受信任的代码
	- 复杂的代码生成和测试

	架构:
	```python
	from openhands.core import Agent, Task

	class EvaluationAgent(Agent):
	def __init__(self, workspace_dir):
	super().__init__(workspace_dir)

	async def evaluate(self, program_path, results_dir):
	# 在沙箱中运行evaluation
	task = Task(
	instruction=f"Evaluate the program at {program_path}",
	workspace=self.workspace
	)

	result = await self.execute(task)
	return result
	```

	劣势: 需要Docker环境，较重量级

	---

	### 方案C: 混合方案（轻量+可选重量） ⭐⭐⭐⭐☆

	策略:
	- 核心用LangGraph（轻量级）
	- 需要沙箱时可选调用OpenHands
	- 最佳灵活性

	```python
	# 核心用LangGraph
	agent = create_langgraph_agent()

	# 某些工具可以选择性使用OpenHands
	@tool
	def execute_untrusted_code(code: str) -> dict:
	"""Execute potentially untrusted code in OpenHands sandbox"""
	if use_sandbox:
	return openhands_execute(code)
	else:
	return safe_local_execute(code)
	```

	---

	## 📦 依赖安装

	### LangGraph方案

	```bash
	# requirements.txt
	langgraph>=0.2.0
	langchain-core>=0.3.0
	langchain-anthropic>=0.2.0 # 如果用Claude
	langchain-google-genai>=2.0.0 # 如果用Gemini
	langchain-community>=0.3.0

	# 安装
	pip install langgraph langchain-core langchain-google-genai
	```

	### OpenHands方案

	```bash
	# OpenHands需要Docker
	docker pull ghcr.io/all-hands-ai/openhands:latest

	# Python客户端
	pip install openhands-ai
	```

	---

	## 🎯 最终推荐

	### 推荐：LangGraph ✨

	理由:
	1. ✅ 完美适配: 图状态机设计天然适合evaluation流程
	2. ✅ 工具丰富: LangChain生态有大量现成工具可用
	3. ✅ 轻量级: 不需要Docker，易于部署
	4. ✅ 可观测: 内置可视化和调试工具
	5. ✅ 灵活: 可以轻松切换LLM后端（Gemini/GPT/Claude）
	6. ✅ 生产就绪: Meta/LangChain维护，社区活跃

	工具format优势:
	```python
	# LangChain的工具format是业界标准
	@tool
	def my_tool(arg1: str, arg2: int) -> dict:
	"""Tool description (用于LLM理解)"""
	# 实现
	return result

	# 自动生成JSON Schema
	# 自动处理参数验证
	# 自动集成到Agent
	```

	可扩展性:
	- 添加新工具只需定义新函数
	- 使用`@tool`装饰器自动注册
	- 可以组合现有LangChain工具（搜索、API调用等）

	---

	## 🚀 快速开始

	```bash
	# 1. 安装依赖
	pip install langgraph langchain-google-genai

	# 2. 创建agent文件
	cp template_evaluation_agent_langgraph.py evaluation_agent.py

	# 3. 配置LocalJobConfig使用新agent
	# (见上面的集成示例)

	# 4. 运行
	python my/run_circle_packing_WITH_agent.py
	```

	---

	## 📊 对比总结表

	\| 框架 \| 适用性 \| 学习曲线 \| 轻量级 \| 工具生态 \| 推荐度 \|
	\|------\|--------\|----------\|--------\|----------\|--------\|
	\| LangGraph \| ⭐⭐⭐⭐⭐ \| 中 \| ✅ \| ⭐⭐⭐⭐⭐ \| 🏆 强烈推荐 \|
	\| OpenHands \| ⭐⭐⭐☆☆ \| 高 \| ❌ \| ⭐⭐⭐☆☆ \| 特殊场景 \|
	\| CrewAI \| ⭐⭐⭐☆☆ \| 低 \| ✅ \| ⭐⭐⭐☆☆ \| 多agent场景 \|
	\| Autogen \| ⭐⭐⭐☆☆ \| 高 \| ⚠️ \| ⭐⭐⭐⭐☆ \| 对话场景 \|
	\| Semantic Kernel \| ⭐⭐⭐⭐☆ \| 中 \| ✅ \| ⭐⭐⭐☆☆ \| 企业场景 \|
	\| 自建 \| ⭐⭐⭐☆☆ \| 低 \| ✅ \| ⭐⭐☆☆☆ \| 学习目的 \|

	---

	使用LangGraph，你可以：
	1. ✅ 借鉴业界标准的工具format（`@tool`装饰器）
	2. ✅ 利用丰富的LangChain工具生态
	3. ✅ 保持代码简洁、可维护
	4. ✅ 轻松扩展新功能
	5. ✅ 完美集成到现有系统

	开始用LangGraph构建你的Evaluation Agent吧！ 🚀