# Agent Framework 对比与选型 ## 🎯 需求回顾 我们需要一个Agent框架来实现Evaluation Agent,核心需求: 1. **工具调用系统**: 需要执行多种工具(运行程序、查询数据库、生成代码等) 2. **LLM集成**: 需要调用多种LLM(Gemini、GPT等) 3. **沙箱执行**: 需要安全执行LLM生成的代码 4. **可扩展性**: 容易添加新工具 5. **接口兼容**: 能封装成命令行工具,输出JSON文件 6. **成本控制**: 能限制API调用次数和成本 --- ## 📊 开源Agent框架对比 ### 1. **OpenHands** (原OpenDevin) **官网**: https://github.com/All-Hands-AI/OpenHands **特点**: ```python ✅ 优势: - 专注于代码相关任务(非常适合evaluation场景) - 内置沙箱环境(Docker-based) - 支持多种LLM后端 - Agent可以执行bash命令、读写文件、运行Python - 强大的工具系统 ❌ 劣势: - 较重量级(需要Docker环境) - 主要设计用于软件开发任务 - 可能over-engineering for evaluation场景 ``` **适用场景**: ⭐⭐⭐☆☆ (3.5/5) - 如果需要复杂的代码生成和执行 - 需要隔离的沙箱环境 - 但对于evaluation任务可能有点重 --- ### 2. **LangGraph** (LangChain生态) **官网**: https://github.com/langchain-ai/langgraph **特点**: ```python ✅ 优势: - 图状态机设计,流程清晰 - LangChain生态,工具丰富 - 轻量级,易于集成 - 状态持久化 - 支持循环和条件分支 - 强大的工具调用能力 ✅ 特别适合: - 需要复杂决策流程的Agent - 多步推理和工具调用 - 状态管理和回溯 ``` **示例架构**: ```python from langgraph.graph import StateGraph, END # 定义状态 class EvaluationState(TypedDict): program_path: str results_dir: str db_context: dict metrics: dict reasoning: list # 定义节点(每个节点是一个工具调用或决策点) def run_program(state): result = run_shinka_eval(state["program_path"]) return {"metrics": result} def query_database(state): context = db.get_historical_context() return {"db_context": context} def llm_analyze(state): plan = llm.plan(state["db_context"], state["metrics"]) return {"reasoning": plan} # 构建图 workflow = StateGraph(EvaluationState) workflow.add_node("run_program", run_program) workflow.add_node("query_db", query_database) workflow.add_node("analyze", llm_analyze) workflow.add_edge("run_program", "query_db") workflow.add_edge("query_db", "analyze") workflow.add_edge("analyze", END) agent = workflow.compile() ``` **适用场景**: ⭐⭐⭐⭐⭐ (5/5) - **强烈推荐!** 完美契合evaluation agent需求 - 轻量级但功能完整 - 流程清晰,易于调试 --- ### 3. **CrewAI** **官网**: https://github.com/joaomdmoura/crewAI **特点**: ```python ✅ 优势: - 多Agent协作(可以有专门的evaluator、analyzer等角色) - 内置角色和任务系统 - 易于使用 ❌ 劣势: - 主要用于多Agent场景 - 对于单一evaluation agent可能过度设计 ``` **适用场景**: ⭐⭐⭐☆☆ (3/5) - 如果未来要扩展为多个评估agent协作 - 但目前单agent场景可能不需要 --- ### 4. **Autogen** (Microsoft) **官网**: https://github.com/microsoft/autogen **特点**: ```python ✅ 优势: - 微软出品,维护良好 - 支持多Agent对话 - 代码执行环境 - 工具调用系统 ❌ 劣势: - 主要为对话场景设计 - 相对复杂 ``` **适用场景**: ⭐⭐⭐☆☆ (3/5) --- ### 5. **Semantic Kernel** (Microsoft) **官网**: https://github.com/microsoft/semantic-kernel **特点**: ```python ✅ 优势: - 轻量级插件系统 - 多语言支持(Python, C#, Java) - 企业级设计 - 函数调用和规划 ❌ 劣势: - 相对底层,需要更多自己实现 - 文档相对少 ``` **适用场景**: ⭐⭐⭐⭐☆ (4/5) --- ### 6. **自建轻量Agent框架** **特点**: ```python ✅ 优势: - 完全控制 - 轻量级 - 针对性强 - 无额外依赖 ❌ 劣势: - 需要从头实现工具系统 - 缺少现成的patterns ``` **适用场景**: ⭐⭐⭐☆☆ (3/5) --- ## 🏆 推荐方案 ### **方案A: LangGraph (强烈推荐) ⭐⭐⭐⭐⭐** **理由**: 1. ✅ **完美契合需求**: 图状态机设计天然适合evaluation流程 2. ✅ **工具生态丰富**: LangChain有大量现成工具 3. ✅ **轻量级**: 不需要Docker等重型依赖 4. ✅ **可观测性强**: 内置状态追踪和可视化 5. ✅ **易于扩展**: 添加新工具只需定义新节点 6. ✅ **成本控制**: 支持token计数和预算限制 **架构示例**: ```python # evaluation_agent_langgraph.py from typing import TypedDict, Annotated, List from langgraph.graph import StateGraph, END from langgraph.prebuilt import ToolExecutor from langchain_core.tools import tool import operator # ============================================================================ # 1. 定义状态 # ============================================================================ class EvaluationState(TypedDict): """Agent的状态""" # 输入 program_path: str results_dir: str db_path: str # 中间状态 program_result: dict db_context: dict auxiliary_metrics: dict # Agent推理 plan: List[str] tool_calls: Annotated[list, operator.add] # 累积工具调用 # 输出 final_metrics: dict correct: bool error: str | None # 元信息 total_cost: float reasoning: List[str] # ============================================================================ # 2. 定义工具(LangChain格式) # ============================================================================ @tool def run_program(program_path: str) -> dict: """Run the program and get raw results (centers, radii, score)""" from shinka.core import run_shinka_eval # 调用现有evaluation逻辑 result = run_shinka_eval( program_path=program_path, results_dir="temp", experiment_fn_name="run_packing", num_runs=1, # ... ) return { "centers": result["centers"].tolist(), "radii": result["radii"].tolist(), "score": result["score"] } @tool def validate_packing(centers: list, radii: list) -> dict: """Validate if packing satisfies all constraints""" from examples.circle_packing.evaluate import adapted_validate_packing is_valid, error = adapted_validate_packing(( np.array(centers), np.array(radii), sum(radii) )) return {"valid": is_valid, "error": error} @tool def query_historical_best(db_path: str, metric: str = "combined_score") -> dict: """Query the best historical program from database""" from shinka.database import ProgramDatabase db = ProgramDatabase.load(db_path) best_program = db.get_best_program(metric=metric) return { "id": best_program.id, "score": best_program.combined_score, "generation": best_program.generation, "metrics": best_program.public_metrics } @tool def compute_auxiliary_metric(metric_name: str, centers: list, radii: list) -> dict: """Compute a predefined auxiliary metric""" from examples.circle_packing.auxiliary_eval import METRIC_REGISTRY metric_func = METRIC_REGISTRY.get(metric_name) if not metric_func: return {"error": f"Metric {metric_name} not found"} result = metric_func(np.array(centers), np.array(radii)) return { "name": result.name, "value": result.value, "interpretation": result.interpretation, "details": result.details } @tool def generate_new_metric_code(purpose: str, context: dict) -> dict: """Use LLM to generate code for a new evaluation metric""" from shinka.llm import LLM llm = LLM("native-gemini-2.5-pro") prompt = f""" Generate a Python function for a new circle packing evaluation metric. Purpose: {purpose} Current context: {context} Requirements: 1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult 2. Use numpy for computations 3. Return MetricResult with name, value, interpretation, description Generate clean, executable code. """ response = llm.query(prompt) code = extract_code(response.content) return {"code": code, "cost": response.cost} # ============================================================================ # 3. 定义节点(Agent的行为) # ============================================================================ def run_program_node(state: EvaluationState) -> dict: """执行程序并获取结果""" result = run_program.invoke({"program_path": state["program_path"]}) return { "program_result": result, "tool_calls": [{"tool": "run_program", "result": result}] } def query_database_node(state: EvaluationState) -> dict: """查询数据库获取历史context""" if not state.get("db_path"): return {"db_context": {}} best = query_historical_best.invoke({"db_path": state["db_path"]}) return { "db_context": {"best_program": best}, "tool_calls": [{"tool": "query_historical_best", "result": best}] } def llm_planning_node(state: EvaluationState) -> dict: """LLM规划下一步评估策略""" from shinka.llm import LLM llm = LLM("native-gemini-2.5-pro", temperature=0.7) prompt = f""" You are evaluating a circle packing program. Current result: - Score: {state["program_result"]["score"]} - Number of circles: {len(state["program_result"]["centers"])} Historical context: - Best score: {state["db_context"].get("best_program", {}).get("score", "unknown")} Available tools: - compute_auxiliary_metric: Compute metrics like packing_efficiency, gap_analysis, etc. - generate_new_metric_code: Generate new evaluation metric if needed Decide which metrics to compute and what analysis to perform. Output a JSON plan with steps. """ response = llm.query(prompt) plan = parse_plan(response.content) return { "plan": plan, "reasoning": [f"LLM planning: {response.content}"], "total_cost": state.get("total_cost", 0) + response.cost } def execute_metrics_node(state: EvaluationState) -> dict: """执行auxiliary metrics计算""" metrics = {} for metric_name in state.get("plan", []): if metric_name.startswith("compute_"): metric_result = compute_auxiliary_metric.invoke({ "metric_name": metric_name.replace("compute_", ""), "centers": state["program_result"]["centers"], "radii": state["program_result"]["radii"] }) metrics[metric_name] = metric_result return { "auxiliary_metrics": metrics, "tool_calls": [{"tool": "compute_auxiliary_metric", "results": metrics}] } def generate_feedback_node(state: EvaluationState) -> dict: """生成最终反馈""" from shinka.llm import LLM llm = LLM("native-gemini-2.5-pro", temperature=0.7) prompt = f""" Generate evaluation feedback for a circle packing program. Results: - Primary score: {state["program_result"]["score"]} - Auxiliary metrics: {state["auxiliary_metrics"]} - Historical best: {state["db_context"].get("best_program", {}).get("score")} Provide: 1. Performance summary 2. Comparison with historical best 3. Specific actionable recommendations """ response = llm.query(prompt) return { "final_metrics": { "combined_score": state["program_result"]["score"], "public": { "num_circles": len(state["program_result"]["centers"]), **state["auxiliary_metrics"] }, "text_feedback": response.content }, "correct": True, "error": None, "total_cost": state.get("total_cost", 0) + response.cost } def save_results_node(state: EvaluationState) -> dict: """保存结果到文件(保证接口兼容)""" from shinka.core.wrap_eval import save_json_results save_json_results( results_dir=state["results_dir"], metrics=state["final_metrics"], correct=state["correct"], error=state["error"] ) # 保存Agent推理过程 import json with open(f"{state['results_dir']}/agent_reasoning.json", "w") as f: json.dump({ "plan": state.get("plan", []), "tool_calls": state.get("tool_calls", []), "reasoning": state.get("reasoning", []), "total_cost": state.get("total_cost", 0) }, f, indent=2) return {} # ============================================================================ # 4. 构建图 # ============================================================================ def create_evaluation_agent(): """创建evaluation agent workflow""" workflow = StateGraph(EvaluationState) # 添加节点 workflow.add_node("run_program", run_program_node) workflow.add_node("query_database", query_database_node) workflow.add_node("llm_planning", llm_planning_node) workflow.add_node("execute_metrics", execute_metrics_node) workflow.add_node("generate_feedback", generate_feedback_node) workflow.add_node("save_results", save_results_node) # 定义流程 workflow.set_entry_point("run_program") workflow.add_edge("run_program", "query_database") workflow.add_edge("query_database", "llm_planning") workflow.add_edge("llm_planning", "execute_metrics") workflow.add_edge("execute_metrics", "generate_feedback") workflow.add_edge("generate_feedback", "save_results") workflow.add_edge("save_results", END) return workflow.compile() # ============================================================================ # 5. 主入口(命令行兼容) # ============================================================================ def main(): import argparse parser = argparse.ArgumentParser() parser.add_argument("--program_path", required=True) parser.add_argument("--results_dir", required=True) parser.add_argument("--db_path", default=None) parser.add_argument("--agent_mode", default="adaptive") args = parser.parse_args() # 创建agent agent = create_evaluation_agent() # 初始状态 initial_state = { "program_path": args.program_path, "results_dir": args.results_dir, "db_path": args.db_path, "tool_calls": [], "total_cost": 0.0, "reasoning": [] } # 运行agent final_state = agent.invoke(initial_state) print(f"✅ Evaluation completed!") print(f"Score: {final_state['final_metrics']['combined_score']}") print(f"Total cost: ${final_state['total_cost']:.4f}") print(f"Results saved to: {args.results_dir}") if __name__ == "__main__": main() ``` **集成到现有系统**: ```python # my/run_circle_packing_WITH_agent.py from shinka.core import EvolutionRunner, EvolutionConfig from shinka.database import DatabaseConfig from shinka.launch import LocalJobConfig # 使用LangGraph agent作为evaluator job_config = LocalJobConfig( eval_program_path="evaluation_agent_langgraph.py", # LangGraph agent extra_cmd_args={ "db_path": "auto", # 自动传递数据库路径 "agent_mode": "adaptive" } ) db_config = DatabaseConfig( db_path="evolution_db.sqlite", # ... 其他配置 ) evo_config = EvolutionConfig( use_text_feedback=True, # ... 其他配置 ) runner = EvolutionRunner( job_config=job_config, db_config=db_config, evo_config=evo_config ) runner.run() ``` --- ### **方案B: OpenHands (如果需要强隔离)** ⭐⭐⭐☆☆ **使用场景**: 如果需要: - 完全隔离的沙箱环境 - 执行不受信任的代码 - 复杂的代码生成和测试 **架构**: ```python from openhands.core import Agent, Task class EvaluationAgent(Agent): def __init__(self, workspace_dir): super().__init__(workspace_dir) async def evaluate(self, program_path, results_dir): # 在沙箱中运行evaluation task = Task( instruction=f"Evaluate the program at {program_path}", workspace=self.workspace ) result = await self.execute(task) return result ``` **劣势**: 需要Docker环境,较重量级 --- ### **方案C: 混合方案(轻量+可选重量)** ⭐⭐⭐⭐☆ **策略**: - 核心用**LangGraph**(轻量级) - 需要沙箱时可选调用**OpenHands** - 最佳灵活性 ```python # 核心用LangGraph agent = create_langgraph_agent() # 某些工具可以选择性使用OpenHands @tool def execute_untrusted_code(code: str) -> dict: """Execute potentially untrusted code in OpenHands sandbox""" if use_sandbox: return openhands_execute(code) else: return safe_local_execute(code) ``` --- ## 📦 依赖安装 ### LangGraph方案 ```bash # requirements.txt langgraph>=0.2.0 langchain-core>=0.3.0 langchain-anthropic>=0.2.0 # 如果用Claude langchain-google-genai>=2.0.0 # 如果用Gemini langchain-community>=0.3.0 # 安装 pip install langgraph langchain-core langchain-google-genai ``` ### OpenHands方案 ```bash # OpenHands需要Docker docker pull ghcr.io/all-hands-ai/openhands:latest # Python客户端 pip install openhands-ai ``` --- ## 🎯 最终推荐 ### **推荐:LangGraph ✨** **理由**: 1. ✅ **完美适配**: 图状态机设计天然适合evaluation流程 2. ✅ **工具丰富**: LangChain生态有大量现成工具可用 3. ✅ **轻量级**: 不需要Docker,易于部署 4. ✅ **可观测**: 内置可视化和调试工具 5. ✅ **灵活**: 可以轻松切换LLM后端(Gemini/GPT/Claude) 6. ✅ **生产就绪**: Meta/LangChain维护,社区活跃 **工具format优势**: ```python # LangChain的工具format是业界标准 @tool def my_tool(arg1: str, arg2: int) -> dict: """Tool description (用于LLM理解)""" # 实现 return result # 自动生成JSON Schema # 自动处理参数验证 # 自动集成到Agent ``` **可扩展性**: - 添加新工具只需定义新函数 - 使用`@tool`装饰器自动注册 - 可以组合现有LangChain工具(搜索、API调用等) --- ## 🚀 快速开始 ```bash # 1. 安装依赖 pip install langgraph langchain-google-genai # 2. 创建agent文件 cp template_evaluation_agent_langgraph.py evaluation_agent.py # 3. 配置LocalJobConfig使用新agent # (见上面的集成示例) # 4. 运行 python my/run_circle_packing_WITH_agent.py ``` --- ## 📊 对比总结表 | 框架 | 适用性 | 学习曲线 | 轻量级 | 工具生态 | 推荐度 | |------|--------|----------|--------|----------|--------| | **LangGraph** | ⭐⭐⭐⭐⭐ | 中 | ✅ | ⭐⭐⭐⭐⭐ | **🏆 强烈推荐** | | OpenHands | ⭐⭐⭐☆☆ | 高 | ❌ | ⭐⭐⭐☆☆ | 特殊场景 | | CrewAI | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐⭐☆☆ | 多agent场景 | | Autogen | ⭐⭐⭐☆☆ | 高 | ⚠️ | ⭐⭐⭐⭐☆ | 对话场景 | | Semantic Kernel | ⭐⭐⭐⭐☆ | 中 | ✅ | ⭐⭐⭐☆☆ | 企业场景 | | 自建 | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐☆☆☆ | 学习目的 | --- 使用**LangGraph**,你可以: 1. ✅ 借鉴业界标准的工具format(`@tool`装饰器) 2. ✅ 利用丰富的LangChain工具生态 3. ✅ 保持代码简洁、可维护 4. ✅ 轻松扩展新功能 5. ✅ 完美集成到现有系统 **开始用LangGraph构建你的Evaluation Agent吧!** 🚀