shinka-backup / docs /agent_framework_comparison.md
JustinTX's picture
Add files using upload-large-folder tool
1556404 verified

Agent Framework 对比与选型

🎯 需求回顾

我们需要一个Agent框架来实现Evaluation Agent,核心需求:

  1. 工具调用系统: 需要执行多种工具(运行程序、查询数据库、生成代码等)
  2. LLM集成: 需要调用多种LLM(Gemini、GPT等)
  3. 沙箱执行: 需要安全执行LLM生成的代码
  4. 可扩展性: 容易添加新工具
  5. 接口兼容: 能封装成命令行工具,输出JSON文件
  6. 成本控制: 能限制API调用次数和成本

📊 开源Agent框架对比

1. OpenHands (原OpenDevin)

官网: https://github.com/All-Hands-AI/OpenHands

特点:

✅ 优势:
- 专注于代码相关任务(非常适合evaluation场景)
- 内置沙箱环境(Docker-based)
- 支持多种LLM后端
- Agent可以执行bash命令、读写文件、运行Python
- 强大的工具系统

❌ 劣势:
- 较重量级(需要Docker环境)
- 主要设计用于软件开发任务
- 可能over-engineering for evaluation场景

适用场景: ⭐⭐⭐☆☆ (3.5/5)

  • 如果需要复杂的代码生成和执行
  • 需要隔离的沙箱环境
  • 但对于evaluation任务可能有点重

2. LangGraph (LangChain生态)

官网: https://github.com/langchain-ai/langgraph

特点:

✅ 优势:
- 图状态机设计,流程清晰
- LangChain生态,工具丰富
- 轻量级,易于集成
- 状态持久化
- 支持循环和条件分支
- 强大的工具调用能力

✅ 特别适合:
- 需要复杂决策流程的Agent
- 多步推理和工具调用
- 状态管理和回溯

示例架构:

from langgraph.graph import StateGraph, END

# 定义状态
class EvaluationState(TypedDict):
    program_path: str
    results_dir: str
    db_context: dict
    metrics: dict
    reasoning: list

# 定义节点(每个节点是一个工具调用或决策点)
def run_program(state):
    result = run_shinka_eval(state["program_path"])
    return {"metrics": result}

def query_database(state):
    context = db.get_historical_context()
    return {"db_context": context}

def llm_analyze(state):
    plan = llm.plan(state["db_context"], state["metrics"])
    return {"reasoning": plan}

# 构建图
workflow = StateGraph(EvaluationState)
workflow.add_node("run_program", run_program)
workflow.add_node("query_db", query_database)
workflow.add_node("analyze", llm_analyze)
workflow.add_edge("run_program", "query_db")
workflow.add_edge("query_db", "analyze")
workflow.add_edge("analyze", END)

agent = workflow.compile()

适用场景: ⭐⭐⭐⭐⭐ (5/5)

  • 强烈推荐! 完美契合evaluation agent需求
  • 轻量级但功能完整
  • 流程清晰,易于调试

3. CrewAI

官网: https://github.com/joaomdmoura/crewAI

特点:

✅ 优势:
- 多Agent协作(可以有专门的evaluator、analyzer等角色)
- 内置角色和任务系统
- 易于使用

❌ 劣势:
- 主要用于多Agent场景
- 对于单一evaluation agent可能过度设计

适用场景: ⭐⭐⭐☆☆ (3/5)

  • 如果未来要扩展为多个评估agent协作
  • 但目前单agent场景可能不需要

4. Autogen (Microsoft)

官网: https://github.com/microsoft/autogen

特点:

✅ 优势:
- 微软出品,维护良好
- 支持多Agent对话
- 代码执行环境
- 工具调用系统

❌ 劣势:
- 主要为对话场景设计
- 相对复杂

适用场景: ⭐⭐⭐☆☆ (3/5)


5. Semantic Kernel (Microsoft)

官网: https://github.com/microsoft/semantic-kernel

特点:

✅ 优势:
- 轻量级插件系统
- 多语言支持(Python, C#, Java)
- 企业级设计
- 函数调用和规划

❌ 劣势:
- 相对底层,需要更多自己实现
- 文档相对少

适用场景: ⭐⭐⭐⭐☆ (4/5)


6. 自建轻量Agent框架

特点:

✅ 优势:
- 完全控制
- 轻量级
- 针对性强
- 无额外依赖

❌ 劣势:
- 需要从头实现工具系统
- 缺少现成的patterns

适用场景: ⭐⭐⭐☆☆ (3/5)


🏆 推荐方案

方案A: LangGraph (强烈推荐) ⭐⭐⭐⭐⭐

理由:

  1. 完美契合需求: 图状态机设计天然适合evaluation流程
  2. 工具生态丰富: LangChain有大量现成工具
  3. 轻量级: 不需要Docker等重型依赖
  4. 可观测性强: 内置状态追踪和可视化
  5. 易于扩展: 添加新工具只需定义新节点
  6. 成本控制: 支持token计数和预算限制

架构示例:

# evaluation_agent_langgraph.py
from typing import TypedDict, Annotated, List
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolExecutor
from langchain_core.tools import tool
import operator

# ============================================================================
# 1. 定义状态
# ============================================================================

class EvaluationState(TypedDict):
    """Agent的状态"""
    # 输入
    program_path: str
    results_dir: str
    db_path: str
    
    # 中间状态
    program_result: dict
    db_context: dict
    auxiliary_metrics: dict
    
    # Agent推理
    plan: List[str]
    tool_calls: Annotated[list, operator.add]  # 累积工具调用
    
    # 输出
    final_metrics: dict
    correct: bool
    error: str | None
    
    # 元信息
    total_cost: float
    reasoning: List[str]


# ============================================================================
# 2. 定义工具(LangChain格式)
# ============================================================================

@tool
def run_program(program_path: str) -> dict:
    """Run the program and get raw results (centers, radii, score)"""
    from shinka.core import run_shinka_eval
    
    # 调用现有evaluation逻辑
    result = run_shinka_eval(
        program_path=program_path,
        results_dir="temp",
        experiment_fn_name="run_packing",
        num_runs=1,
        # ...
    )
    return {
        "centers": result["centers"].tolist(),
        "radii": result["radii"].tolist(),
        "score": result["score"]
    }

@tool
def validate_packing(centers: list, radii: list) -> dict:
    """Validate if packing satisfies all constraints"""
    from examples.circle_packing.evaluate import adapted_validate_packing
    
    is_valid, error = adapted_validate_packing((
        np.array(centers),
        np.array(radii),
        sum(radii)
    ))
    
    return {"valid": is_valid, "error": error}

@tool
def query_historical_best(db_path: str, metric: str = "combined_score") -> dict:
    """Query the best historical program from database"""
    from shinka.database import ProgramDatabase
    
    db = ProgramDatabase.load(db_path)
    best_program = db.get_best_program(metric=metric)
    
    return {
        "id": best_program.id,
        "score": best_program.combined_score,
        "generation": best_program.generation,
        "metrics": best_program.public_metrics
    }

@tool
def compute_auxiliary_metric(metric_name: str, centers: list, radii: list) -> dict:
    """Compute a predefined auxiliary metric"""
    from examples.circle_packing.auxiliary_eval import METRIC_REGISTRY
    
    metric_func = METRIC_REGISTRY.get(metric_name)
    if not metric_func:
        return {"error": f"Metric {metric_name} not found"}
    
    result = metric_func(np.array(centers), np.array(radii))
    return {
        "name": result.name,
        "value": result.value,
        "interpretation": result.interpretation,
        "details": result.details
    }

@tool
def generate_new_metric_code(purpose: str, context: dict) -> dict:
    """Use LLM to generate code for a new evaluation metric"""
    from shinka.llm import LLM
    
    llm = LLM("native-gemini-2.5-pro")
    
    prompt = f"""
    Generate a Python function for a new circle packing evaluation metric.
    
    Purpose: {purpose}
    Current context: {context}
    
    Requirements:
    1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
    2. Use numpy for computations
    3. Return MetricResult with name, value, interpretation, description
    
    Generate clean, executable code.
    """
    
    response = llm.query(prompt)
    code = extract_code(response.content)
    
    return {"code": code, "cost": response.cost}


# ============================================================================
# 3. 定义节点(Agent的行为)
# ============================================================================

def run_program_node(state: EvaluationState) -> dict:
    """执行程序并获取结果"""
    result = run_program.invoke({"program_path": state["program_path"]})
    return {
        "program_result": result,
        "tool_calls": [{"tool": "run_program", "result": result}]
    }

def query_database_node(state: EvaluationState) -> dict:
    """查询数据库获取历史context"""
    if not state.get("db_path"):
        return {"db_context": {}}
    
    best = query_historical_best.invoke({"db_path": state["db_path"]})
    
    return {
        "db_context": {"best_program": best},
        "tool_calls": [{"tool": "query_historical_best", "result": best}]
    }

def llm_planning_node(state: EvaluationState) -> dict:
    """LLM规划下一步评估策略"""
    from shinka.llm import LLM
    
    llm = LLM("native-gemini-2.5-pro", temperature=0.7)
    
    prompt = f"""
    You are evaluating a circle packing program.
    
    Current result:
    - Score: {state["program_result"]["score"]}
    - Number of circles: {len(state["program_result"]["centers"])}
    
    Historical context:
    - Best score: {state["db_context"].get("best_program", {}).get("score", "unknown")}
    
    Available tools:
    - compute_auxiliary_metric: Compute metrics like packing_efficiency, gap_analysis, etc.
    - generate_new_metric_code: Generate new evaluation metric if needed
    
    Decide which metrics to compute and what analysis to perform.
    Output a JSON plan with steps.
    """
    
    response = llm.query(prompt)
    plan = parse_plan(response.content)
    
    return {
        "plan": plan,
        "reasoning": [f"LLM planning: {response.content}"],
        "total_cost": state.get("total_cost", 0) + response.cost
    }

def execute_metrics_node(state: EvaluationState) -> dict:
    """执行auxiliary metrics计算"""
    metrics = {}
    
    for metric_name in state.get("plan", []):
        if metric_name.startswith("compute_"):
            metric_result = compute_auxiliary_metric.invoke({
                "metric_name": metric_name.replace("compute_", ""),
                "centers": state["program_result"]["centers"],
                "radii": state["program_result"]["radii"]
            })
            metrics[metric_name] = metric_result
    
    return {
        "auxiliary_metrics": metrics,
        "tool_calls": [{"tool": "compute_auxiliary_metric", "results": metrics}]
    }

def generate_feedback_node(state: EvaluationState) -> dict:
    """生成最终反馈"""
    from shinka.llm import LLM
    
    llm = LLM("native-gemini-2.5-pro", temperature=0.7)
    
    prompt = f"""
    Generate evaluation feedback for a circle packing program.
    
    Results:
    - Primary score: {state["program_result"]["score"]}
    - Auxiliary metrics: {state["auxiliary_metrics"]}
    - Historical best: {state["db_context"].get("best_program", {}).get("score")}
    
    Provide:
    1. Performance summary
    2. Comparison with historical best
    3. Specific actionable recommendations
    """
    
    response = llm.query(prompt)
    
    return {
        "final_metrics": {
            "combined_score": state["program_result"]["score"],
            "public": {
                "num_circles": len(state["program_result"]["centers"]),
                **state["auxiliary_metrics"]
            },
            "text_feedback": response.content
        },
        "correct": True,
        "error": None,
        "total_cost": state.get("total_cost", 0) + response.cost
    }

def save_results_node(state: EvaluationState) -> dict:
    """保存结果到文件(保证接口兼容)"""
    from shinka.core.wrap_eval import save_json_results
    
    save_json_results(
        results_dir=state["results_dir"],
        metrics=state["final_metrics"],
        correct=state["correct"],
        error=state["error"]
    )
    
    # 保存Agent推理过程
    import json
    with open(f"{state['results_dir']}/agent_reasoning.json", "w") as f:
        json.dump({
            "plan": state.get("plan", []),
            "tool_calls": state.get("tool_calls", []),
            "reasoning": state.get("reasoning", []),
            "total_cost": state.get("total_cost", 0)
        }, f, indent=2)
    
    return {}


# ============================================================================
# 4. 构建图
# ============================================================================

def create_evaluation_agent():
    """创建evaluation agent workflow"""
    
    workflow = StateGraph(EvaluationState)
    
    # 添加节点
    workflow.add_node("run_program", run_program_node)
    workflow.add_node("query_database", query_database_node)
    workflow.add_node("llm_planning", llm_planning_node)
    workflow.add_node("execute_metrics", execute_metrics_node)
    workflow.add_node("generate_feedback", generate_feedback_node)
    workflow.add_node("save_results", save_results_node)
    
    # 定义流程
    workflow.set_entry_point("run_program")
    workflow.add_edge("run_program", "query_database")
    workflow.add_edge("query_database", "llm_planning")
    workflow.add_edge("llm_planning", "execute_metrics")
    workflow.add_edge("execute_metrics", "generate_feedback")
    workflow.add_edge("generate_feedback", "save_results")
    workflow.add_edge("save_results", END)
    
    return workflow.compile()


# ============================================================================
# 5. 主入口(命令行兼容)
# ============================================================================

def main():
    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument("--program_path", required=True)
    parser.add_argument("--results_dir", required=True)
    parser.add_argument("--db_path", default=None)
    parser.add_argument("--agent_mode", default="adaptive")
    args = parser.parse_args()
    
    # 创建agent
    agent = create_evaluation_agent()
    
    # 初始状态
    initial_state = {
        "program_path": args.program_path,
        "results_dir": args.results_dir,
        "db_path": args.db_path,
        "tool_calls": [],
        "total_cost": 0.0,
        "reasoning": []
    }
    
    # 运行agent
    final_state = agent.invoke(initial_state)
    
    print(f"✅ Evaluation completed!")
    print(f"Score: {final_state['final_metrics']['combined_score']}")
    print(f"Total cost: ${final_state['total_cost']:.4f}")
    print(f"Results saved to: {args.results_dir}")


if __name__ == "__main__":
    main()

集成到现有系统:

# my/run_circle_packing_WITH_agent.py
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig

# 使用LangGraph agent作为evaluator
job_config = LocalJobConfig(
    eval_program_path="evaluation_agent_langgraph.py",  # LangGraph agent
    extra_cmd_args={
        "db_path": "auto",  # 自动传递数据库路径
        "agent_mode": "adaptive"
    }
)

db_config = DatabaseConfig(
    db_path="evolution_db.sqlite",
    # ... 其他配置
)

evo_config = EvolutionConfig(
    use_text_feedback=True,
    # ... 其他配置
)

runner = EvolutionRunner(
    job_config=job_config,
    db_config=db_config,
    evo_config=evo_config
)

runner.run()

方案B: OpenHands (如果需要强隔离) ⭐⭐⭐☆☆

使用场景: 如果需要:

  • 完全隔离的沙箱环境
  • 执行不受信任的代码
  • 复杂的代码生成和测试

架构:

from openhands.core import Agent, Task

class EvaluationAgent(Agent):
    def __init__(self, workspace_dir):
        super().__init__(workspace_dir)
        
    async def evaluate(self, program_path, results_dir):
        # 在沙箱中运行evaluation
        task = Task(
            instruction=f"Evaluate the program at {program_path}",
            workspace=self.workspace
        )
        
        result = await self.execute(task)
        return result

劣势: 需要Docker环境,较重量级


方案C: 混合方案(轻量+可选重量) ⭐⭐⭐⭐☆

策略:

  • 核心用LangGraph(轻量级)
  • 需要沙箱时可选调用OpenHands
  • 最佳灵活性
# 核心用LangGraph
agent = create_langgraph_agent()

# 某些工具可以选择性使用OpenHands
@tool
def execute_untrusted_code(code: str) -> dict:
    """Execute potentially untrusted code in OpenHands sandbox"""
    if use_sandbox:
        return openhands_execute(code)
    else:
        return safe_local_execute(code)

📦 依赖安装

LangGraph方案

# requirements.txt
langgraph>=0.2.0
langchain-core>=0.3.0
langchain-anthropic>=0.2.0  # 如果用Claude
langchain-google-genai>=2.0.0  # 如果用Gemini
langchain-community>=0.3.0

# 安装
pip install langgraph langchain-core langchain-google-genai

OpenHands方案

# OpenHands需要Docker
docker pull ghcr.io/all-hands-ai/openhands:latest

# Python客户端
pip install openhands-ai

🎯 最终推荐

推荐:LangGraph ✨

理由:

  1. 完美适配: 图状态机设计天然适合evaluation流程
  2. 工具丰富: LangChain生态有大量现成工具可用
  3. 轻量级: 不需要Docker,易于部署
  4. 可观测: 内置可视化和调试工具
  5. 灵活: 可以轻松切换LLM后端(Gemini/GPT/Claude)
  6. 生产就绪: Meta/LangChain维护,社区活跃

工具format优势:

# LangChain的工具format是业界标准
@tool
def my_tool(arg1: str, arg2: int) -> dict:
    """Tool description (用于LLM理解)"""
    # 实现
    return result

# 自动生成JSON Schema
# 自动处理参数验证
# 自动集成到Agent

可扩展性:

  • 添加新工具只需定义新函数
  • 使用@tool装饰器自动注册
  • 可以组合现有LangChain工具(搜索、API调用等)

🚀 快速开始

# 1. 安装依赖
pip install langgraph langchain-google-genai

# 2. 创建agent文件
cp template_evaluation_agent_langgraph.py evaluation_agent.py

# 3. 配置LocalJobConfig使用新agent
# (见上面的集成示例)

# 4. 运行
python my/run_circle_packing_WITH_agent.py

📊 对比总结表

框架 适用性 学习曲线 轻量级 工具生态 推荐度
LangGraph ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 🏆 强烈推荐
OpenHands ⭐⭐⭐☆☆ ⭐⭐⭐☆☆ 特殊场景
CrewAI ⭐⭐⭐☆☆ ⭐⭐⭐☆☆ 多agent场景
Autogen ⭐⭐⭐☆☆ ⚠️ ⭐⭐⭐⭐☆ 对话场景
Semantic Kernel ⭐⭐⭐⭐☆ ⭐⭐⭐☆☆ 企业场景
自建 ⭐⭐⭐☆☆ ⭐⭐☆☆☆ 学习目的

使用LangGraph,你可以:

  1. ✅ 借鉴业界标准的工具format(@tool装饰器)
  2. ✅ 利用丰富的LangChain工具生态
  3. ✅ 保持代码简洁、可维护
  4. ✅ 轻松扩展新功能
  5. ✅ 完美集成到现有系统

开始用LangGraph构建你的Evaluation Agent吧! 🚀