| # Agent Framework 对比与选型 |
|
|
| ## 🎯 需求回顾 |
|
|
| 我们需要一个Agent框架来实现Evaluation Agent,核心需求: |
|
|
| 1. **工具调用系统**: 需要执行多种工具(运行程序、查询数据库、生成代码等) |
| 2. **LLM集成**: 需要调用多种LLM(Gemini、GPT等) |
| 3. **沙箱执行**: 需要安全执行LLM生成的代码 |
| 4. **可扩展性**: 容易添加新工具 |
| 5. **接口兼容**: 能封装成命令行工具,输出JSON文件 |
| 6. **成本控制**: 能限制API调用次数和成本 |
|
|
| --- |
|
|
| ## 📊 开源Agent框架对比 |
|
|
| ### 1. **OpenHands** (原OpenDevin) |
|
|
| **官网**: https://github.com/All-Hands-AI/OpenHands |
|
|
| **特点**: |
| ```python |
| ✅ 优势: |
| - 专注于代码相关任务(非常适合evaluation场景) |
| - 内置沙箱环境(Docker-based) |
| - 支持多种LLM后端 |
| - Agent可以执行bash命令、读写文件、运行Python |
| - 强大的工具系统 |
| |
| ❌ 劣势: |
| - 较重量级(需要Docker环境) |
| - 主要设计用于软件开发任务 |
| - 可能over-engineering for evaluation场景 |
| ``` |
|
|
| **适用场景**: ⭐⭐⭐☆☆ (3.5/5) |
| - 如果需要复杂的代码生成和执行 |
| - 需要隔离的沙箱环境 |
| - 但对于evaluation任务可能有点重 |
|
|
| --- |
|
|
| ### 2. **LangGraph** (LangChain生态) |
|
|
| **官网**: https://github.com/langchain-ai/langgraph |
|
|
| **特点**: |
| ```python |
| ✅ 优势: |
| - 图状态机设计,流程清晰 |
| - LangChain生态,工具丰富 |
| - 轻量级,易于集成 |
| - 状态持久化 |
| - 支持循环和条件分支 |
| - 强大的工具调用能力 |
| |
| ✅ 特别适合: |
| - 需要复杂决策流程的Agent |
| - 多步推理和工具调用 |
| - 状态管理和回溯 |
| ``` |
|
|
| **示例架构**: |
| ```python |
| from langgraph.graph import StateGraph, END |
| |
| # 定义状态 |
| class EvaluationState(TypedDict): |
| program_path: str |
| results_dir: str |
| db_context: dict |
| metrics: dict |
| reasoning: list |
| |
| # 定义节点(每个节点是一个工具调用或决策点) |
| def run_program(state): |
| result = run_shinka_eval(state["program_path"]) |
| return {"metrics": result} |
| |
| def query_database(state): |
| context = db.get_historical_context() |
| return {"db_context": context} |
| |
| def llm_analyze(state): |
| plan = llm.plan(state["db_context"], state["metrics"]) |
| return {"reasoning": plan} |
| |
| # 构建图 |
| workflow = StateGraph(EvaluationState) |
| workflow.add_node("run_program", run_program) |
| workflow.add_node("query_db", query_database) |
| workflow.add_node("analyze", llm_analyze) |
| workflow.add_edge("run_program", "query_db") |
| workflow.add_edge("query_db", "analyze") |
| workflow.add_edge("analyze", END) |
| |
| agent = workflow.compile() |
| ``` |
|
|
| **适用场景**: ⭐⭐⭐⭐⭐ (5/5) |
| - **强烈推荐!** 完美契合evaluation agent需求 |
| - 轻量级但功能完整 |
| - 流程清晰,易于调试 |
|
|
| --- |
|
|
| ### 3. **CrewAI** |
|
|
| **官网**: https://github.com/joaomdmoura/crewAI |
|
|
| **特点**: |
| ```python |
| ✅ 优势: |
| - 多Agent协作(可以有专门的evaluator、analyzer等角色) |
| - 内置角色和任务系统 |
| - 易于使用 |
| |
| ❌ 劣势: |
| - 主要用于多Agent场景 |
| - 对于单一evaluation agent可能过度设计 |
| ``` |
|
|
| **适用场景**: ⭐⭐⭐☆☆ (3/5) |
| - 如果未来要扩展为多个评估agent协作 |
| - 但目前单agent场景可能不需要 |
|
|
| --- |
|
|
| ### 4. **Autogen** (Microsoft) |
|
|
| **官网**: https://github.com/microsoft/autogen |
|
|
| **特点**: |
| ```python |
| ✅ 优势: |
| - 微软出品,维护良好 |
| - 支持多Agent对话 |
| - 代码执行环境 |
| - 工具调用系统 |
| |
| ❌ 劣势: |
| - 主要为对话场景设计 |
| - 相对复杂 |
| ``` |
|
|
| **适用场景**: ⭐⭐⭐☆☆ (3/5) |
|
|
| --- |
|
|
| ### 5. **Semantic Kernel** (Microsoft) |
|
|
| **官网**: https://github.com/microsoft/semantic-kernel |
|
|
| **特点**: |
| ```python |
| ✅ 优势: |
| - 轻量级插件系统 |
| - 多语言支持(Python, C#, Java) |
| - 企业级设计 |
| - 函数调用和规划 |
| |
| ❌ 劣势: |
| - 相对底层,需要更多自己实现 |
| - 文档相对少 |
| ``` |
|
|
| **适用场景**: ⭐⭐⭐⭐☆ (4/5) |
|
|
| --- |
|
|
| ### 6. **自建轻量Agent框架** |
|
|
| **特点**: |
| ```python |
| ✅ 优势: |
| - 完全控制 |
| - 轻量级 |
| - 针对性强 |
| - 无额外依赖 |
| |
| ❌ 劣势: |
| - 需要从头实现工具系统 |
| - 缺少现成的patterns |
| ``` |
|
|
| **适用场景**: ⭐⭐⭐☆☆ (3/5) |
|
|
| --- |
|
|
| ## 🏆 推荐方案 |
|
|
| ### **方案A: LangGraph (强烈推荐) ⭐⭐⭐⭐⭐** |
|
|
| **理由**: |
| 1. ✅ **完美契合需求**: 图状态机设计天然适合evaluation流程 |
| 2. ✅ **工具生态丰富**: LangChain有大量现成工具 |
| 3. ✅ **轻量级**: 不需要Docker等重型依赖 |
| 4. ✅ **可观测性强**: 内置状态追踪和可视化 |
| 5. ✅ **易于扩展**: 添加新工具只需定义新节点 |
| 6. ✅ **成本控制**: 支持token计数和预算限制 |
|
|
| **架构示例**: |
|
|
| ```python |
| # evaluation_agent_langgraph.py |
| from typing import TypedDict, Annotated, List |
| from langgraph.graph import StateGraph, END |
| from langgraph.prebuilt import ToolExecutor |
| from langchain_core.tools import tool |
| import operator |
| |
| # ============================================================================ |
| # 1. 定义状态 |
| # ============================================================================ |
| |
| class EvaluationState(TypedDict): |
| """Agent的状态""" |
| # 输入 |
| program_path: str |
| results_dir: str |
| db_path: str |
| |
| # 中间状态 |
| program_result: dict |
| db_context: dict |
| auxiliary_metrics: dict |
| |
| # Agent推理 |
| plan: List[str] |
| tool_calls: Annotated[list, operator.add] # 累积工具调用 |
| |
| # 输出 |
| final_metrics: dict |
| correct: bool |
| error: str | None |
| |
| # 元信息 |
| total_cost: float |
| reasoning: List[str] |
| |
| |
| # ============================================================================ |
| # 2. 定义工具(LangChain格式) |
| # ============================================================================ |
| |
| @tool |
| def run_program(program_path: str) -> dict: |
| """Run the program and get raw results (centers, radii, score)""" |
| from shinka.core import run_shinka_eval |
| |
| # 调用现有evaluation逻辑 |
| result = run_shinka_eval( |
| program_path=program_path, |
| results_dir="temp", |
| experiment_fn_name="run_packing", |
| num_runs=1, |
| # ... |
| ) |
| return { |
| "centers": result["centers"].tolist(), |
| "radii": result["radii"].tolist(), |
| "score": result["score"] |
| } |
| |
| @tool |
| def validate_packing(centers: list, radii: list) -> dict: |
| """Validate if packing satisfies all constraints""" |
| from examples.circle_packing.evaluate import adapted_validate_packing |
| |
| is_valid, error = adapted_validate_packing(( |
| np.array(centers), |
| np.array(radii), |
| sum(radii) |
| )) |
| |
| return {"valid": is_valid, "error": error} |
| |
| @tool |
| def query_historical_best(db_path: str, metric: str = "combined_score") -> dict: |
| """Query the best historical program from database""" |
| from shinka.database import ProgramDatabase |
| |
| db = ProgramDatabase.load(db_path) |
| best_program = db.get_best_program(metric=metric) |
| |
| return { |
| "id": best_program.id, |
| "score": best_program.combined_score, |
| "generation": best_program.generation, |
| "metrics": best_program.public_metrics |
| } |
| |
| @tool |
| def compute_auxiliary_metric(metric_name: str, centers: list, radii: list) -> dict: |
| """Compute a predefined auxiliary metric""" |
| from examples.circle_packing.auxiliary_eval import METRIC_REGISTRY |
| |
| metric_func = METRIC_REGISTRY.get(metric_name) |
| if not metric_func: |
| return {"error": f"Metric {metric_name} not found"} |
| |
| result = metric_func(np.array(centers), np.array(radii)) |
| return { |
| "name": result.name, |
| "value": result.value, |
| "interpretation": result.interpretation, |
| "details": result.details |
| } |
| |
| @tool |
| def generate_new_metric_code(purpose: str, context: dict) -> dict: |
| """Use LLM to generate code for a new evaluation metric""" |
| from shinka.llm import LLM |
| |
| llm = LLM("native-gemini-2.5-pro") |
| |
| prompt = f""" |
| Generate a Python function for a new circle packing evaluation metric. |
| |
| Purpose: {purpose} |
| Current context: {context} |
| |
| Requirements: |
| 1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult |
| 2. Use numpy for computations |
| 3. Return MetricResult with name, value, interpretation, description |
| |
| Generate clean, executable code. |
| """ |
| |
| response = llm.query(prompt) |
| code = extract_code(response.content) |
| |
| return {"code": code, "cost": response.cost} |
| |
| |
| # ============================================================================ |
| # 3. 定义节点(Agent的行为) |
| # ============================================================================ |
| |
| def run_program_node(state: EvaluationState) -> dict: |
| """执行程序并获取结果""" |
| result = run_program.invoke({"program_path": state["program_path"]}) |
| return { |
| "program_result": result, |
| "tool_calls": [{"tool": "run_program", "result": result}] |
| } |
| |
| def query_database_node(state: EvaluationState) -> dict: |
| """查询数据库获取历史context""" |
| if not state.get("db_path"): |
| return {"db_context": {}} |
| |
| best = query_historical_best.invoke({"db_path": state["db_path"]}) |
| |
| return { |
| "db_context": {"best_program": best}, |
| "tool_calls": [{"tool": "query_historical_best", "result": best}] |
| } |
| |
| def llm_planning_node(state: EvaluationState) -> dict: |
| """LLM规划下一步评估策略""" |
| from shinka.llm import LLM |
| |
| llm = LLM("native-gemini-2.5-pro", temperature=0.7) |
| |
| prompt = f""" |
| You are evaluating a circle packing program. |
| |
| Current result: |
| - Score: {state["program_result"]["score"]} |
| - Number of circles: {len(state["program_result"]["centers"])} |
| |
| Historical context: |
| - Best score: {state["db_context"].get("best_program", {}).get("score", "unknown")} |
| |
| Available tools: |
| - compute_auxiliary_metric: Compute metrics like packing_efficiency, gap_analysis, etc. |
| - generate_new_metric_code: Generate new evaluation metric if needed |
| |
| Decide which metrics to compute and what analysis to perform. |
| Output a JSON plan with steps. |
| """ |
| |
| response = llm.query(prompt) |
| plan = parse_plan(response.content) |
| |
| return { |
| "plan": plan, |
| "reasoning": [f"LLM planning: {response.content}"], |
| "total_cost": state.get("total_cost", 0) + response.cost |
| } |
| |
| def execute_metrics_node(state: EvaluationState) -> dict: |
| """执行auxiliary metrics计算""" |
| metrics = {} |
| |
| for metric_name in state.get("plan", []): |
| if metric_name.startswith("compute_"): |
| metric_result = compute_auxiliary_metric.invoke({ |
| "metric_name": metric_name.replace("compute_", ""), |
| "centers": state["program_result"]["centers"], |
| "radii": state["program_result"]["radii"] |
| }) |
| metrics[metric_name] = metric_result |
| |
| return { |
| "auxiliary_metrics": metrics, |
| "tool_calls": [{"tool": "compute_auxiliary_metric", "results": metrics}] |
| } |
| |
| def generate_feedback_node(state: EvaluationState) -> dict: |
| """生成最终反馈""" |
| from shinka.llm import LLM |
| |
| llm = LLM("native-gemini-2.5-pro", temperature=0.7) |
| |
| prompt = f""" |
| Generate evaluation feedback for a circle packing program. |
| |
| Results: |
| - Primary score: {state["program_result"]["score"]} |
| - Auxiliary metrics: {state["auxiliary_metrics"]} |
| - Historical best: {state["db_context"].get("best_program", {}).get("score")} |
| |
| Provide: |
| 1. Performance summary |
| 2. Comparison with historical best |
| 3. Specific actionable recommendations |
| """ |
| |
| response = llm.query(prompt) |
| |
| return { |
| "final_metrics": { |
| "combined_score": state["program_result"]["score"], |
| "public": { |
| "num_circles": len(state["program_result"]["centers"]), |
| **state["auxiliary_metrics"] |
| }, |
| "text_feedback": response.content |
| }, |
| "correct": True, |
| "error": None, |
| "total_cost": state.get("total_cost", 0) + response.cost |
| } |
| |
| def save_results_node(state: EvaluationState) -> dict: |
| """保存结果到文件(保证接口兼容)""" |
| from shinka.core.wrap_eval import save_json_results |
| |
| save_json_results( |
| results_dir=state["results_dir"], |
| metrics=state["final_metrics"], |
| correct=state["correct"], |
| error=state["error"] |
| ) |
| |
| # 保存Agent推理过程 |
| import json |
| with open(f"{state['results_dir']}/agent_reasoning.json", "w") as f: |
| json.dump({ |
| "plan": state.get("plan", []), |
| "tool_calls": state.get("tool_calls", []), |
| "reasoning": state.get("reasoning", []), |
| "total_cost": state.get("total_cost", 0) |
| }, f, indent=2) |
| |
| return {} |
| |
| |
| # ============================================================================ |
| # 4. 构建图 |
| # ============================================================================ |
| |
| def create_evaluation_agent(): |
| """创建evaluation agent workflow""" |
| |
| workflow = StateGraph(EvaluationState) |
| |
| # 添加节点 |
| workflow.add_node("run_program", run_program_node) |
| workflow.add_node("query_database", query_database_node) |
| workflow.add_node("llm_planning", llm_planning_node) |
| workflow.add_node("execute_metrics", execute_metrics_node) |
| workflow.add_node("generate_feedback", generate_feedback_node) |
| workflow.add_node("save_results", save_results_node) |
| |
| # 定义流程 |
| workflow.set_entry_point("run_program") |
| workflow.add_edge("run_program", "query_database") |
| workflow.add_edge("query_database", "llm_planning") |
| workflow.add_edge("llm_planning", "execute_metrics") |
| workflow.add_edge("execute_metrics", "generate_feedback") |
| workflow.add_edge("generate_feedback", "save_results") |
| workflow.add_edge("save_results", END) |
| |
| return workflow.compile() |
| |
| |
| # ============================================================================ |
| # 5. 主入口(命令行兼容) |
| # ============================================================================ |
| |
| def main(): |
| import argparse |
| |
| parser = argparse.ArgumentParser() |
| parser.add_argument("--program_path", required=True) |
| parser.add_argument("--results_dir", required=True) |
| parser.add_argument("--db_path", default=None) |
| parser.add_argument("--agent_mode", default="adaptive") |
| args = parser.parse_args() |
| |
| # 创建agent |
| agent = create_evaluation_agent() |
| |
| # 初始状态 |
| initial_state = { |
| "program_path": args.program_path, |
| "results_dir": args.results_dir, |
| "db_path": args.db_path, |
| "tool_calls": [], |
| "total_cost": 0.0, |
| "reasoning": [] |
| } |
| |
| # 运行agent |
| final_state = agent.invoke(initial_state) |
| |
| print(f"✅ Evaluation completed!") |
| print(f"Score: {final_state['final_metrics']['combined_score']}") |
| print(f"Total cost: ${final_state['total_cost']:.4f}") |
| print(f"Results saved to: {args.results_dir}") |
| |
| |
| if __name__ == "__main__": |
| main() |
| ``` |
|
|
| **集成到现有系统**: |
|
|
| ```python |
| # my/run_circle_packing_WITH_agent.py |
| from shinka.core import EvolutionRunner, EvolutionConfig |
| from shinka.database import DatabaseConfig |
| from shinka.launch import LocalJobConfig |
| |
| # 使用LangGraph agent作为evaluator |
| job_config = LocalJobConfig( |
| eval_program_path="evaluation_agent_langgraph.py", # LangGraph agent |
| extra_cmd_args={ |
| "db_path": "auto", # 自动传递数据库路径 |
| "agent_mode": "adaptive" |
| } |
| ) |
| |
| db_config = DatabaseConfig( |
| db_path="evolution_db.sqlite", |
| # ... 其他配置 |
| ) |
| |
| evo_config = EvolutionConfig( |
| use_text_feedback=True, |
| # ... 其他配置 |
| ) |
| |
| runner = EvolutionRunner( |
| job_config=job_config, |
| db_config=db_config, |
| evo_config=evo_config |
| ) |
| |
| runner.run() |
| ``` |
|
|
| --- |
|
|
| ### **方案B: OpenHands (如果需要强隔离)** ⭐⭐⭐☆☆ |
|
|
| **使用场景**: 如果需要: |
| - 完全隔离的沙箱环境 |
| - 执行不受信任的代码 |
| - 复杂的代码生成和测试 |
|
|
| **架构**: |
| ```python |
| from openhands.core import Agent, Task |
| |
| class EvaluationAgent(Agent): |
| def __init__(self, workspace_dir): |
| super().__init__(workspace_dir) |
| |
| async def evaluate(self, program_path, results_dir): |
| # 在沙箱中运行evaluation |
| task = Task( |
| instruction=f"Evaluate the program at {program_path}", |
| workspace=self.workspace |
| ) |
| |
| result = await self.execute(task) |
| return result |
| ``` |
|
|
| **劣势**: 需要Docker环境,较重量级 |
|
|
| --- |
|
|
| ### **方案C: 混合方案(轻量+可选重量)** ⭐⭐⭐⭐☆ |
|
|
| **策略**: |
| - 核心用**LangGraph**(轻量级) |
| - 需要沙箱时可选调用**OpenHands** |
| - 最佳灵活性 |
|
|
| ```python |
| # 核心用LangGraph |
| agent = create_langgraph_agent() |
| |
| # 某些工具可以选择性使用OpenHands |
| @tool |
| def execute_untrusted_code(code: str) -> dict: |
| """Execute potentially untrusted code in OpenHands sandbox""" |
| if use_sandbox: |
| return openhands_execute(code) |
| else: |
| return safe_local_execute(code) |
| ``` |
|
|
| --- |
|
|
| ## 📦 依赖安装 |
|
|
| ### LangGraph方案 |
|
|
| ```bash |
| # requirements.txt |
| langgraph>=0.2.0 |
| langchain-core>=0.3.0 |
| langchain-anthropic>=0.2.0 # 如果用Claude |
| langchain-google-genai>=2.0.0 # 如果用Gemini |
| langchain-community>=0.3.0 |
| |
| # 安装 |
| pip install langgraph langchain-core langchain-google-genai |
| ``` |
|
|
| ### OpenHands方案 |
|
|
| ```bash |
| # OpenHands需要Docker |
| docker pull ghcr.io/all-hands-ai/openhands:latest |
| |
| # Python客户端 |
| pip install openhands-ai |
| ``` |
|
|
| --- |
|
|
| ## 🎯 最终推荐 |
|
|
| ### **推荐:LangGraph ✨** |
|
|
| **理由**: |
| 1. ✅ **完美适配**: 图状态机设计天然适合evaluation流程 |
| 2. ✅ **工具丰富**: LangChain生态有大量现成工具可用 |
| 3. ✅ **轻量级**: 不需要Docker,易于部署 |
| 4. ✅ **可观测**: 内置可视化和调试工具 |
| 5. ✅ **灵活**: 可以轻松切换LLM后端(Gemini/GPT/Claude) |
| 6. ✅ **生产就绪**: Meta/LangChain维护,社区活跃 |
|
|
| **工具format优势**: |
| ```python |
| # LangChain的工具format是业界标准 |
| @tool |
| def my_tool(arg1: str, arg2: int) -> dict: |
| """Tool description (用于LLM理解)""" |
| # 实现 |
| return result |
| |
| # 自动生成JSON Schema |
| # 自动处理参数验证 |
| # 自动集成到Agent |
| ``` |
|
|
| **可扩展性**: |
| - 添加新工具只需定义新函数 |
| - 使用`@tool`装饰器自动注册 |
| - 可以组合现有LangChain工具(搜索、API调用等) |
|
|
| --- |
|
|
| ## 🚀 快速开始 |
|
|
| ```bash |
| # 1. 安装依赖 |
| pip install langgraph langchain-google-genai |
| |
| # 2. 创建agent文件 |
| cp template_evaluation_agent_langgraph.py evaluation_agent.py |
| |
| # 3. 配置LocalJobConfig使用新agent |
| # (见上面的集成示例) |
| |
| # 4. 运行 |
| python my/run_circle_packing_WITH_agent.py |
| ``` |
|
|
| --- |
|
|
| ## 📊 对比总结表 |
|
|
| | 框架 | 适用性 | 学习曲线 | 轻量级 | 工具生态 | 推荐度 | |
| |------|--------|----------|--------|----------|--------| |
| | **LangGraph** | ⭐⭐⭐⭐⭐ | 中 | ✅ | ⭐⭐⭐⭐⭐ | **🏆 强烈推荐** | |
| | OpenHands | ⭐⭐⭐☆☆ | 高 | ❌ | ⭐⭐⭐☆☆ | 特殊场景 | |
| | CrewAI | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐⭐☆☆ | 多agent场景 | |
| | Autogen | ⭐⭐⭐☆☆ | 高 | ⚠️ | ⭐⭐⭐⭐☆ | 对话场景 | |
| | Semantic Kernel | ⭐⭐⭐⭐☆ | 中 | ✅ | ⭐⭐⭐☆☆ | 企业场景 | |
| | 自建 | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐☆☆☆ | 学习目的 | |
|
|
| --- |
|
|
| 使用**LangGraph**,你可以: |
| 1. ✅ 借鉴业界标准的工具format(`@tool`装饰器) |
| 2. ✅ 利用丰富的LangChain工具生态 |
| 3. ✅ 保持代码简洁、可维护 |
| 4. ✅ 轻松扩展新功能 |
| 5. ✅ 完美集成到现有系统 |
|
|
| **开始用LangGraph构建你的Evaluation Agent吧!** 🚀 |
|
|