File size: 19,868 Bytes
# Agent Framework 对比与选型

## 🎯 需求回顾

我们需要一个Agent框架来实现Evaluation Agent，核心需求：

1. **工具调用系统**: 需要执行多种工具（运行程序、查询数据库、生成代码等）
2. **LLM集成**: 需要调用多种LLM（Gemini、GPT等）
3. **沙箱执行**: 需要安全执行LLM生成的代码
4. **可扩展性**: 容易添加新工具
5. **接口兼容**: 能封装成命令行工具，输出JSON文件
6. **成本控制**: 能限制API调用次数和成本

---

## 📊 开源Agent框架对比

### 1. **OpenHands** (原OpenDevin)

**官网**: https://github.com/All-Hands-AI/OpenHands

**特点**:
```python
✅ 优势:
- 专注于代码相关任务（非常适合evaluation场景）
- 内置沙箱环境（Docker-based）
- 支持多种LLM后端
- Agent可以执行bash命令、读写文件、运行Python
- 强大的工具系统

❌ 劣势:
- 较重量级（需要Docker环境）
- 主要设计用于软件开发任务
- 可能over-engineering for evaluation场景
```

**适用场景**: ⭐⭐⭐☆☆ (3.5/5)
- 如果需要复杂的代码生成和执行
- 需要隔离的沙箱环境
- 但对于evaluation任务可能有点重

---

### 2. **LangGraph** (LangChain生态)

**官网**: https://github.com/langchain-ai/langgraph

**特点**:
```python
✅ 优势:
- 图状态机设计，流程清晰
- LangChain生态，工具丰富
- 轻量级，易于集成
- 状态持久化
- 支持循环和条件分支
- 强大的工具调用能力

✅ 特别适合:
- 需要复杂决策流程的Agent
- 多步推理和工具调用
- 状态管理和回溯
```

**示例架构**:
```python
from langgraph.graph import StateGraph, END

# 定义状态
class EvaluationState(TypedDict):
    program_path: str
    results_dir: str
    db_context: dict
    metrics: dict
    reasoning: list

# 定义节点（每个节点是一个工具调用或决策点）
def run_program(state):
    result = run_shinka_eval(state["program_path"])
    return {"metrics": result}

def query_database(state):
    context = db.get_historical_context()
    return {"db_context": context}

def llm_analyze(state):
    plan = llm.plan(state["db_context"], state["metrics"])
    return {"reasoning": plan}

# 构建图
workflow = StateGraph(EvaluationState)
workflow.add_node("run_program", run_program)
workflow.add_node("query_db", query_database)
workflow.add_node("analyze", llm_analyze)
workflow.add_edge("run_program", "query_db")
workflow.add_edge("query_db", "analyze")
workflow.add_edge("analyze", END)

agent = workflow.compile()
```

**适用场景**: ⭐⭐⭐⭐⭐ (5/5)
- **强烈推荐！** 完美契合evaluation agent需求
- 轻量级但功能完整
- 流程清晰，易于调试

---

### 3. **CrewAI**

**官网**: https://github.com/joaomdmoura/crewAI

**特点**:
```python
✅ 优势:
- 多Agent协作（可以有专门的evaluator、analyzer等角色）
- 内置角色和任务系统
- 易于使用

❌ 劣势:
- 主要用于多Agent场景
- 对于单一evaluation agent可能过度设计
```

**适用场景**: ⭐⭐⭐☆☆ (3/5)
- 如果未来要扩展为多个评估agent协作
- 但目前单agent场景可能不需要

---

### 4. **Autogen** (Microsoft)

**官网**: https://github.com/microsoft/autogen

**特点**:
```python
✅ 优势:
- 微软出品，维护良好
- 支持多Agent对话
- 代码执行环境
- 工具调用系统

❌ 劣势:
- 主要为对话场景设计
- 相对复杂
```

**适用场景**: ⭐⭐⭐☆☆ (3/5)

---

### 5. **Semantic Kernel** (Microsoft)

**官网**: https://github.com/microsoft/semantic-kernel

**特点**:
```python
✅ 优势:
- 轻量级插件系统
- 多语言支持（Python, C#, Java）
- 企业级设计
- 函数调用和规划

❌ 劣势:
- 相对底层，需要更多自己实现
- 文档相对少
```

**适用场景**: ⭐⭐⭐⭐☆ (4/5)

---

### 6. **自建轻量Agent框架**

**特点**:
```python
✅ 优势:
- 完全控制
- 轻量级
- 针对性强
- 无额外依赖

❌ 劣势:
- 需要从头实现工具系统
- 缺少现成的patterns
```

**适用场景**: ⭐⭐⭐☆☆ (3/5)

---

## 🏆 推荐方案

### **方案A: LangGraph (强烈推荐) ⭐⭐⭐⭐⭐**

**理由**:
1. ✅ **完美契合需求**: 图状态机设计天然适合evaluation流程
2. ✅ **工具生态丰富**: LangChain有大量现成工具
3. ✅ **轻量级**: 不需要Docker等重型依赖
4. ✅ **可观测性强**: 内置状态追踪和可视化
5. ✅ **易于扩展**: 添加新工具只需定义新节点
6. ✅ **成本控制**: 支持token计数和预算限制

**架构示例**:

```python
# evaluation_agent_langgraph.py
from typing import TypedDict, Annotated, List
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolExecutor
from langchain_core.tools import tool
import operator

# ============================================================================
# 1. 定义状态
# ============================================================================

class EvaluationState(TypedDict):
    """Agent的状态"""
    # 输入
    program_path: str
    results_dir: str
    db_path: str
    
    # 中间状态
    program_result: dict
    db_context: dict
    auxiliary_metrics: dict
    
    # Agent推理
    plan: List[str]
    tool_calls: Annotated[list, operator.add]  # 累积工具调用
    
    # 输出
    final_metrics: dict
    correct: bool
    error: str | None
    
    # 元信息
    total_cost: float
    reasoning: List[str]


# ============================================================================
# 2. 定义工具（LangChain格式）
# ============================================================================

@tool
def run_program(program_path: str) -> dict:
    """Run the program and get raw results (centers, radii, score)"""
    from shinka.core import run_shinka_eval
    
    # 调用现有evaluation逻辑
    result = run_shinka_eval(
        program_path=program_path,
        results_dir="temp",
        experiment_fn_name="run_packing",
        num_runs=1,
        # ...
    )
    return {
        "centers": result["centers"].tolist(),
        "radii": result["radii"].tolist(),
        "score": result["score"]
    }

@tool
def validate_packing(centers: list, radii: list) -> dict:
    """Validate if packing satisfies all constraints"""
    from examples.circle_packing.evaluate import adapted_validate_packing
    
    is_valid, error = adapted_validate_packing((
        np.array(centers),
        np.array(radii),
        sum(radii)
    ))
    
    return {"valid": is_valid, "error": error}

@tool
def query_historical_best(db_path: str, metric: str = "combined_score") -> dict:
    """Query the best historical program from database"""
    from shinka.database import ProgramDatabase
    
    db = ProgramDatabase.load(db_path)
    best_program = db.get_best_program(metric=metric)
    
    return {
        "id": best_program.id,
        "score": best_program.combined_score,
        "generation": best_program.generation,
        "metrics": best_program.public_metrics
    }

@tool
def compute_auxiliary_metric(metric_name: str, centers: list, radii: list) -> dict:
    """Compute a predefined auxiliary metric"""
    from examples.circle_packing.auxiliary_eval import METRIC_REGISTRY
    
    metric_func = METRIC_REGISTRY.get(metric_name)
    if not metric_func:
        return {"error": f"Metric {metric_name} not found"}
    
    result = metric_func(np.array(centers), np.array(radii))
    return {
        "name": result.name,
        "value": result.value,
        "interpretation": result.interpretation,
        "details": result.details
    }

@tool
def generate_new_metric_code(purpose: str, context: dict) -> dict:
    """Use LLM to generate code for a new evaluation metric"""
    from shinka.llm import LLM
    
    llm = LLM("native-gemini-2.5-pro")
    
    prompt = f"""
    Generate a Python function for a new circle packing evaluation metric.
    
    Purpose: {purpose}
    Current context: {context}
    
    Requirements:
    1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
    2. Use numpy for computations
    3. Return MetricResult with name, value, interpretation, description
    
    Generate clean, executable code.
    """
    
    response = llm.query(prompt)
    code = extract_code(response.content)
    
    return {"code": code, "cost": response.cost}


# ============================================================================
# 3. 定义节点（Agent的行为）
# ============================================================================

def run_program_node(state: EvaluationState) -> dict:
    """执行程序并获取结果"""
    result = run_program.invoke({"program_path": state["program_path"]})
    return {
        "program_result": result,
        "tool_calls": [{"tool": "run_program", "result": result}]
    }

def query_database_node(state: EvaluationState) -> dict:
    """查询数据库获取历史context"""
    if not state.get("db_path"):
        return {"db_context": {}}
    
    best = query_historical_best.invoke({"db_path": state["db_path"]})
    
    return {
        "db_context": {"best_program": best},
        "tool_calls": [{"tool": "query_historical_best", "result": best}]
    }

def llm_planning_node(state: EvaluationState) -> dict:
    """LLM规划下一步评估策略"""
    from shinka.llm import LLM
    
    llm = LLM("native-gemini-2.5-pro", temperature=0.7)
    
    prompt = f"""
    You are evaluating a circle packing program.
    
    Current result:
    - Score: {state["program_result"]["score"]}
    - Number of circles: {len(state["program_result"]["centers"])}
    
    Historical context:
    - Best score: {state["db_context"].get("best_program", {}).get("score", "unknown")}
    
    Available tools:
    - compute_auxiliary_metric: Compute metrics like packing_efficiency, gap_analysis, etc.
    - generate_new_metric_code: Generate new evaluation metric if needed
    
    Decide which metrics to compute and what analysis to perform.
    Output a JSON plan with steps.
    """
    
    response = llm.query(prompt)
    plan = parse_plan(response.content)
    
    return {
        "plan": plan,
        "reasoning": [f"LLM planning: {response.content}"],
        "total_cost": state.get("total_cost", 0) + response.cost
    }

def execute_metrics_node(state: EvaluationState) -> dict:
    """执行auxiliary metrics计算"""
    metrics = {}
    
    for metric_name in state.get("plan", []):
        if metric_name.startswith("compute_"):
            metric_result = compute_auxiliary_metric.invoke({
                "metric_name": metric_name.replace("compute_", ""),
                "centers": state["program_result"]["centers"],
                "radii": state["program_result"]["radii"]
            })
            metrics[metric_name] = metric_result
    
    return {
        "auxiliary_metrics": metrics,
        "tool_calls": [{"tool": "compute_auxiliary_metric", "results": metrics}]
    }

def generate_feedback_node(state: EvaluationState) -> dict:
    """生成最终反馈"""
    from shinka.llm import LLM
    
    llm = LLM("native-gemini-2.5-pro", temperature=0.7)
    
    prompt = f"""
    Generate evaluation feedback for a circle packing program.
    
    Results:
    - Primary score: {state["program_result"]["score"]}
    - Auxiliary metrics: {state["auxiliary_metrics"]}
    - Historical best: {state["db_context"].get("best_program", {}).get("score")}
    
    Provide:
    1. Performance summary
    2. Comparison with historical best
    3. Specific actionable recommendations
    """
    
    response = llm.query(prompt)
    
    return {
        "final_metrics": {
            "combined_score": state["program_result"]["score"],
            "public": {
                "num_circles": len(state["program_result"]["centers"]),
                **state["auxiliary_metrics"]
            },
            "text_feedback": response.content
        },
        "correct": True,
        "error": None,
        "total_cost": state.get("total_cost", 0) + response.cost
    }

def save_results_node(state: EvaluationState) -> dict:
    """保存结果到文件（保证接口兼容）"""
    from shinka.core.wrap_eval import save_json_results
    
    save_json_results(
        results_dir=state["results_dir"],
        metrics=state["final_metrics"],
        correct=state["correct"],
        error=state["error"]
    )
    
    # 保存Agent推理过程
    import json
    with open(f"{state['results_dir']}/agent_reasoning.json", "w") as f:
        json.dump({
            "plan": state.get("plan", []),
            "tool_calls": state.get("tool_calls", []),
            "reasoning": state.get("reasoning", []),
            "total_cost": state.get("total_cost", 0)
        }, f, indent=2)
    
    return {}


# ============================================================================
# 4. 构建图
# ============================================================================

def create_evaluation_agent():
    """创建evaluation agent workflow"""
    
    workflow = StateGraph(EvaluationState)
    
    # 添加节点
    workflow.add_node("run_program", run_program_node)
    workflow.add_node("query_database", query_database_node)
    workflow.add_node("llm_planning", llm_planning_node)
    workflow.add_node("execute_metrics", execute_metrics_node)
    workflow.add_node("generate_feedback", generate_feedback_node)
    workflow.add_node("save_results", save_results_node)
    
    # 定义流程
    workflow.set_entry_point("run_program")
    workflow.add_edge("run_program", "query_database")
    workflow.add_edge("query_database", "llm_planning")
    workflow.add_edge("llm_planning", "execute_metrics")
    workflow.add_edge("execute_metrics", "generate_feedback")
    workflow.add_edge("generate_feedback", "save_results")
    workflow.add_edge("save_results", END)
    
    return workflow.compile()


# ============================================================================
# 5. 主入口（命令行兼容）
# ============================================================================

def main():
    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument("--program_path", required=True)
    parser.add_argument("--results_dir", required=True)
    parser.add_argument("--db_path", default=None)
    parser.add_argument("--agent_mode", default="adaptive")
    args = parser.parse_args()
    
    # 创建agent
    agent = create_evaluation_agent()
    
    # 初始状态
    initial_state = {
        "program_path": args.program_path,
        "results_dir": args.results_dir,
        "db_path": args.db_path,
        "tool_calls": [],
        "total_cost": 0.0,
        "reasoning": []
    }
    
    # 运行agent
    final_state = agent.invoke(initial_state)
    
    print(f"✅ Evaluation completed!")
    print(f"Score: {final_state['final_metrics']['combined_score']}")
    print(f"Total cost: ${final_state['total_cost']:.4f}")
    print(f"Results saved to: {args.results_dir}")


if __name__ == "__main__":
    main()
```

**集成到现有系统**:

```python
# my/run_circle_packing_WITH_agent.py
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig

# 使用LangGraph agent作为evaluator
job_config = LocalJobConfig(
    eval_program_path="evaluation_agent_langgraph.py",  # LangGraph agent
    extra_cmd_args={
        "db_path": "auto",  # 自动传递数据库路径
        "agent_mode": "adaptive"
    }
)

db_config = DatabaseConfig(
    db_path="evolution_db.sqlite",
    # ... 其他配置
)

evo_config = EvolutionConfig(
    use_text_feedback=True,
    # ... 其他配置
)

runner = EvolutionRunner(
    job_config=job_config,
    db_config=db_config,
    evo_config=evo_config
)

runner.run()
```

---

### **方案B: OpenHands (如果需要强隔离)** ⭐⭐⭐☆☆

**使用场景**: 如果需要：
- 完全隔离的沙箱环境
- 执行不受信任的代码
- 复杂的代码生成和测试

**架构**:
```python
from openhands.core import Agent, Task

class EvaluationAgent(Agent):
    def __init__(self, workspace_dir):
        super().__init__(workspace_dir)
        
    async def evaluate(self, program_path, results_dir):
        # 在沙箱中运行evaluation
        task = Task(
            instruction=f"Evaluate the program at {program_path}",
            workspace=self.workspace
        )
        
        result = await self.execute(task)
        return result
```

**劣势**: 需要Docker环境，较重量级

---

### **方案C: 混合方案（轻量+可选重量）** ⭐⭐⭐⭐☆

**策略**:
- 核心用**LangGraph**（轻量级）
- 需要沙箱时可选调用**OpenHands**
- 最佳灵活性

```python
# 核心用LangGraph
agent = create_langgraph_agent()

# 某些工具可以选择性使用OpenHands
@tool
def execute_untrusted_code(code: str) -> dict:
    """Execute potentially untrusted code in OpenHands sandbox"""
    if use_sandbox:
        return openhands_execute(code)
    else:
        return safe_local_execute(code)
```

---

## 📦 依赖安装

### LangGraph方案

```bash
# requirements.txt
langgraph>=0.2.0
langchain-core>=0.3.0
langchain-anthropic>=0.2.0  # 如果用Claude
langchain-google-genai>=2.0.0  # 如果用Gemini
langchain-community>=0.3.0

# 安装
pip install langgraph langchain-core langchain-google-genai
```

### OpenHands方案

```bash
# OpenHands需要Docker
docker pull ghcr.io/all-hands-ai/openhands:latest

# Python客户端
pip install openhands-ai
```

---

## 🎯 最终推荐

### **推荐：LangGraph ✨**

**理由**:
1. ✅ **完美适配**: 图状态机设计天然适合evaluation流程
2. ✅ **工具丰富**: LangChain生态有大量现成工具可用
3. ✅ **轻量级**: 不需要Docker，易于部署
4. ✅ **可观测**: 内置可视化和调试工具
5. ✅ **灵活**: 可以轻松切换LLM后端（Gemini/GPT/Claude）
6. ✅ **生产就绪**: Meta/LangChain维护，社区活跃

**工具format优势**:
```python
# LangChain的工具format是业界标准
@tool
def my_tool(arg1: str, arg2: int) -> dict:
    """Tool description (用于LLM理解)"""
    # 实现
    return result

# 自动生成JSON Schema
# 自动处理参数验证
# 自动集成到Agent
```

**可扩展性**:
- 添加新工具只需定义新函数
- 使用`@tool`装饰器自动注册
- 可以组合现有LangChain工具（搜索、API调用等）

---

## 🚀 快速开始

```bash
# 1. 安装依赖
pip install langgraph langchain-google-genai

# 2. 创建agent文件
cp template_evaluation_agent_langgraph.py evaluation_agent.py

# 3. 配置LocalJobConfig使用新agent
# (见上面的集成示例)

# 4. 运行
python my/run_circle_packing_WITH_agent.py
```

---

## 📊 对比总结表

| 框架 | 适用性 | 学习曲线 | 轻量级 | 工具生态 | 推荐度 |
|------|--------|----------|--------|----------|--------|
| **LangGraph** | ⭐⭐⭐⭐⭐ | 中 | ✅ | ⭐⭐⭐⭐⭐ | **🏆 强烈推荐** |
| OpenHands | ⭐⭐⭐☆☆ | 高 | ❌ | ⭐⭐⭐☆☆ | 特殊场景 |
| CrewAI | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐⭐☆☆ | 多agent场景 |
| Autogen | ⭐⭐⭐☆☆ | 高 | ⚠️ | ⭐⭐⭐⭐☆ | 对话场景 |
| Semantic Kernel | ⭐⭐⭐⭐☆ | 中 | ✅ | ⭐⭐⭐☆☆ | 企业场景 |
| 自建 | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐☆☆☆ | 学习目的 |

---

使用**LangGraph**，你可以：
1. ✅ 借鉴业界标准的工具format（`@tool`装饰器）
2. ✅ 利用丰富的LangChain工具生态
3. ✅ 保持代码简洁、可维护
4. ✅ 轻松扩展新功能
5. ✅ 完美集成到现有系统

**开始用LangGraph构建你的Evaluation Agent吧！** 🚀