Agent Framework 对比与选型
🎯 需求回顾
我们需要一个Agent框架来实现Evaluation Agent,核心需求:
- 工具调用系统: 需要执行多种工具(运行程序、查询数据库、生成代码等)
- LLM集成: 需要调用多种LLM(Gemini、GPT等)
- 沙箱执行: 需要安全执行LLM生成的代码
- 可扩展性: 容易添加新工具
- 接口兼容: 能封装成命令行工具,输出JSON文件
- 成本控制: 能限制API调用次数和成本
📊 开源Agent框架对比
1. OpenHands (原OpenDevin)
官网: https://github.com/All-Hands-AI/OpenHands
特点:
✅ 优势:
- 专注于代码相关任务(非常适合evaluation场景)
- 内置沙箱环境(Docker-based)
- 支持多种LLM后端
- Agent可以执行bash命令、读写文件、运行Python
- 强大的工具系统
❌ 劣势:
- 较重量级(需要Docker环境)
- 主要设计用于软件开发任务
- 可能over-engineering for evaluation场景
适用场景: ⭐⭐⭐☆☆ (3.5/5)
- 如果需要复杂的代码生成和执行
- 需要隔离的沙箱环境
- 但对于evaluation任务可能有点重
2. LangGraph (LangChain生态)
官网: https://github.com/langchain-ai/langgraph
特点:
✅ 优势:
- 图状态机设计,流程清晰
- LangChain生态,工具丰富
- 轻量级,易于集成
- 状态持久化
- 支持循环和条件分支
- 强大的工具调用能力
✅ 特别适合:
- 需要复杂决策流程的Agent
- 多步推理和工具调用
- 状态管理和回溯
示例架构:
from langgraph.graph import StateGraph, END
# 定义状态
class EvaluationState(TypedDict):
program_path: str
results_dir: str
db_context: dict
metrics: dict
reasoning: list
# 定义节点(每个节点是一个工具调用或决策点)
def run_program(state):
result = run_shinka_eval(state["program_path"])
return {"metrics": result}
def query_database(state):
context = db.get_historical_context()
return {"db_context": context}
def llm_analyze(state):
plan = llm.plan(state["db_context"], state["metrics"])
return {"reasoning": plan}
# 构建图
workflow = StateGraph(EvaluationState)
workflow.add_node("run_program", run_program)
workflow.add_node("query_db", query_database)
workflow.add_node("analyze", llm_analyze)
workflow.add_edge("run_program", "query_db")
workflow.add_edge("query_db", "analyze")
workflow.add_edge("analyze", END)
agent = workflow.compile()
适用场景: ⭐⭐⭐⭐⭐ (5/5)
- 强烈推荐! 完美契合evaluation agent需求
- 轻量级但功能完整
- 流程清晰,易于调试
3. CrewAI
官网: https://github.com/joaomdmoura/crewAI
特点:
✅ 优势:
- 多Agent协作(可以有专门的evaluator、analyzer等角色)
- 内置角色和任务系统
- 易于使用
❌ 劣势:
- 主要用于多Agent场景
- 对于单一evaluation agent可能过度设计
适用场景: ⭐⭐⭐☆☆ (3/5)
- 如果未来要扩展为多个评估agent协作
- 但目前单agent场景可能不需要
4. Autogen (Microsoft)
官网: https://github.com/microsoft/autogen
特点:
✅ 优势:
- 微软出品,维护良好
- 支持多Agent对话
- 代码执行环境
- 工具调用系统
❌ 劣势:
- 主要为对话场景设计
- 相对复杂
适用场景: ⭐⭐⭐☆☆ (3/5)
5. Semantic Kernel (Microsoft)
官网: https://github.com/microsoft/semantic-kernel
特点:
✅ 优势:
- 轻量级插件系统
- 多语言支持(Python, C#, Java)
- 企业级设计
- 函数调用和规划
❌ 劣势:
- 相对底层,需要更多自己实现
- 文档相对少
适用场景: ⭐⭐⭐⭐☆ (4/5)
6. 自建轻量Agent框架
特点:
✅ 优势:
- 完全控制
- 轻量级
- 针对性强
- 无额外依赖
❌ 劣势:
- 需要从头实现工具系统
- 缺少现成的patterns
适用场景: ⭐⭐⭐☆☆ (3/5)
🏆 推荐方案
方案A: LangGraph (强烈推荐) ⭐⭐⭐⭐⭐
理由:
- ✅ 完美契合需求: 图状态机设计天然适合evaluation流程
- ✅ 工具生态丰富: LangChain有大量现成工具
- ✅ 轻量级: 不需要Docker等重型依赖
- ✅ 可观测性强: 内置状态追踪和可视化
- ✅ 易于扩展: 添加新工具只需定义新节点
- ✅ 成本控制: 支持token计数和预算限制
架构示例:
# evaluation_agent_langgraph.py
from typing import TypedDict, Annotated, List
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolExecutor
from langchain_core.tools import tool
import operator
# ============================================================================
# 1. 定义状态
# ============================================================================
class EvaluationState(TypedDict):
"""Agent的状态"""
# 输入
program_path: str
results_dir: str
db_path: str
# 中间状态
program_result: dict
db_context: dict
auxiliary_metrics: dict
# Agent推理
plan: List[str]
tool_calls: Annotated[list, operator.add] # 累积工具调用
# 输出
final_metrics: dict
correct: bool
error: str | None
# 元信息
total_cost: float
reasoning: List[str]
# ============================================================================
# 2. 定义工具(LangChain格式)
# ============================================================================
@tool
def run_program(program_path: str) -> dict:
"""Run the program and get raw results (centers, radii, score)"""
from shinka.core import run_shinka_eval
# 调用现有evaluation逻辑
result = run_shinka_eval(
program_path=program_path,
results_dir="temp",
experiment_fn_name="run_packing",
num_runs=1,
# ...
)
return {
"centers": result["centers"].tolist(),
"radii": result["radii"].tolist(),
"score": result["score"]
}
@tool
def validate_packing(centers: list, radii: list) -> dict:
"""Validate if packing satisfies all constraints"""
from examples.circle_packing.evaluate import adapted_validate_packing
is_valid, error = adapted_validate_packing((
np.array(centers),
np.array(radii),
sum(radii)
))
return {"valid": is_valid, "error": error}
@tool
def query_historical_best(db_path: str, metric: str = "combined_score") -> dict:
"""Query the best historical program from database"""
from shinka.database import ProgramDatabase
db = ProgramDatabase.load(db_path)
best_program = db.get_best_program(metric=metric)
return {
"id": best_program.id,
"score": best_program.combined_score,
"generation": best_program.generation,
"metrics": best_program.public_metrics
}
@tool
def compute_auxiliary_metric(metric_name: str, centers: list, radii: list) -> dict:
"""Compute a predefined auxiliary metric"""
from examples.circle_packing.auxiliary_eval import METRIC_REGISTRY
metric_func = METRIC_REGISTRY.get(metric_name)
if not metric_func:
return {"error": f"Metric {metric_name} not found"}
result = metric_func(np.array(centers), np.array(radii))
return {
"name": result.name,
"value": result.value,
"interpretation": result.interpretation,
"details": result.details
}
@tool
def generate_new_metric_code(purpose: str, context: dict) -> dict:
"""Use LLM to generate code for a new evaluation metric"""
from shinka.llm import LLM
llm = LLM("native-gemini-2.5-pro")
prompt = f"""
Generate a Python function for a new circle packing evaluation metric.
Purpose: {purpose}
Current context: {context}
Requirements:
1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
2. Use numpy for computations
3. Return MetricResult with name, value, interpretation, description
Generate clean, executable code.
"""
response = llm.query(prompt)
code = extract_code(response.content)
return {"code": code, "cost": response.cost}
# ============================================================================
# 3. 定义节点(Agent的行为)
# ============================================================================
def run_program_node(state: EvaluationState) -> dict:
"""执行程序并获取结果"""
result = run_program.invoke({"program_path": state["program_path"]})
return {
"program_result": result,
"tool_calls": [{"tool": "run_program", "result": result}]
}
def query_database_node(state: EvaluationState) -> dict:
"""查询数据库获取历史context"""
if not state.get("db_path"):
return {"db_context": {}}
best = query_historical_best.invoke({"db_path": state["db_path"]})
return {
"db_context": {"best_program": best},
"tool_calls": [{"tool": "query_historical_best", "result": best}]
}
def llm_planning_node(state: EvaluationState) -> dict:
"""LLM规划下一步评估策略"""
from shinka.llm import LLM
llm = LLM("native-gemini-2.5-pro", temperature=0.7)
prompt = f"""
You are evaluating a circle packing program.
Current result:
- Score: {state["program_result"]["score"]}
- Number of circles: {len(state["program_result"]["centers"])}
Historical context:
- Best score: {state["db_context"].get("best_program", {}).get("score", "unknown")}
Available tools:
- compute_auxiliary_metric: Compute metrics like packing_efficiency, gap_analysis, etc.
- generate_new_metric_code: Generate new evaluation metric if needed
Decide which metrics to compute and what analysis to perform.
Output a JSON plan with steps.
"""
response = llm.query(prompt)
plan = parse_plan(response.content)
return {
"plan": plan,
"reasoning": [f"LLM planning: {response.content}"],
"total_cost": state.get("total_cost", 0) + response.cost
}
def execute_metrics_node(state: EvaluationState) -> dict:
"""执行auxiliary metrics计算"""
metrics = {}
for metric_name in state.get("plan", []):
if metric_name.startswith("compute_"):
metric_result = compute_auxiliary_metric.invoke({
"metric_name": metric_name.replace("compute_", ""),
"centers": state["program_result"]["centers"],
"radii": state["program_result"]["radii"]
})
metrics[metric_name] = metric_result
return {
"auxiliary_metrics": metrics,
"tool_calls": [{"tool": "compute_auxiliary_metric", "results": metrics}]
}
def generate_feedback_node(state: EvaluationState) -> dict:
"""生成最终反馈"""
from shinka.llm import LLM
llm = LLM("native-gemini-2.5-pro", temperature=0.7)
prompt = f"""
Generate evaluation feedback for a circle packing program.
Results:
- Primary score: {state["program_result"]["score"]}
- Auxiliary metrics: {state["auxiliary_metrics"]}
- Historical best: {state["db_context"].get("best_program", {}).get("score")}
Provide:
1. Performance summary
2. Comparison with historical best
3. Specific actionable recommendations
"""
response = llm.query(prompt)
return {
"final_metrics": {
"combined_score": state["program_result"]["score"],
"public": {
"num_circles": len(state["program_result"]["centers"]),
**state["auxiliary_metrics"]
},
"text_feedback": response.content
},
"correct": True,
"error": None,
"total_cost": state.get("total_cost", 0) + response.cost
}
def save_results_node(state: EvaluationState) -> dict:
"""保存结果到文件(保证接口兼容)"""
from shinka.core.wrap_eval import save_json_results
save_json_results(
results_dir=state["results_dir"],
metrics=state["final_metrics"],
correct=state["correct"],
error=state["error"]
)
# 保存Agent推理过程
import json
with open(f"{state['results_dir']}/agent_reasoning.json", "w") as f:
json.dump({
"plan": state.get("plan", []),
"tool_calls": state.get("tool_calls", []),
"reasoning": state.get("reasoning", []),
"total_cost": state.get("total_cost", 0)
}, f, indent=2)
return {}
# ============================================================================
# 4. 构建图
# ============================================================================
def create_evaluation_agent():
"""创建evaluation agent workflow"""
workflow = StateGraph(EvaluationState)
# 添加节点
workflow.add_node("run_program", run_program_node)
workflow.add_node("query_database", query_database_node)
workflow.add_node("llm_planning", llm_planning_node)
workflow.add_node("execute_metrics", execute_metrics_node)
workflow.add_node("generate_feedback", generate_feedback_node)
workflow.add_node("save_results", save_results_node)
# 定义流程
workflow.set_entry_point("run_program")
workflow.add_edge("run_program", "query_database")
workflow.add_edge("query_database", "llm_planning")
workflow.add_edge("llm_planning", "execute_metrics")
workflow.add_edge("execute_metrics", "generate_feedback")
workflow.add_edge("generate_feedback", "save_results")
workflow.add_edge("save_results", END)
return workflow.compile()
# ============================================================================
# 5. 主入口(命令行兼容)
# ============================================================================
def main():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--program_path", required=True)
parser.add_argument("--results_dir", required=True)
parser.add_argument("--db_path", default=None)
parser.add_argument("--agent_mode", default="adaptive")
args = parser.parse_args()
# 创建agent
agent = create_evaluation_agent()
# 初始状态
initial_state = {
"program_path": args.program_path,
"results_dir": args.results_dir,
"db_path": args.db_path,
"tool_calls": [],
"total_cost": 0.0,
"reasoning": []
}
# 运行agent
final_state = agent.invoke(initial_state)
print(f"✅ Evaluation completed!")
print(f"Score: {final_state['final_metrics']['combined_score']}")
print(f"Total cost: ${final_state['total_cost']:.4f}")
print(f"Results saved to: {args.results_dir}")
if __name__ == "__main__":
main()
集成到现有系统:
# my/run_circle_packing_WITH_agent.py
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig
# 使用LangGraph agent作为evaluator
job_config = LocalJobConfig(
eval_program_path="evaluation_agent_langgraph.py", # LangGraph agent
extra_cmd_args={
"db_path": "auto", # 自动传递数据库路径
"agent_mode": "adaptive"
}
)
db_config = DatabaseConfig(
db_path="evolution_db.sqlite",
# ... 其他配置
)
evo_config = EvolutionConfig(
use_text_feedback=True,
# ... 其他配置
)
runner = EvolutionRunner(
job_config=job_config,
db_config=db_config,
evo_config=evo_config
)
runner.run()
方案B: OpenHands (如果需要强隔离) ⭐⭐⭐☆☆
使用场景: 如果需要:
- 完全隔离的沙箱环境
- 执行不受信任的代码
- 复杂的代码生成和测试
架构:
from openhands.core import Agent, Task
class EvaluationAgent(Agent):
def __init__(self, workspace_dir):
super().__init__(workspace_dir)
async def evaluate(self, program_path, results_dir):
# 在沙箱中运行evaluation
task = Task(
instruction=f"Evaluate the program at {program_path}",
workspace=self.workspace
)
result = await self.execute(task)
return result
劣势: 需要Docker环境,较重量级
方案C: 混合方案(轻量+可选重量) ⭐⭐⭐⭐☆
策略:
- 核心用LangGraph(轻量级)
- 需要沙箱时可选调用OpenHands
- 最佳灵活性
# 核心用LangGraph
agent = create_langgraph_agent()
# 某些工具可以选择性使用OpenHands
@tool
def execute_untrusted_code(code: str) -> dict:
"""Execute potentially untrusted code in OpenHands sandbox"""
if use_sandbox:
return openhands_execute(code)
else:
return safe_local_execute(code)
📦 依赖安装
LangGraph方案
# requirements.txt
langgraph>=0.2.0
langchain-core>=0.3.0
langchain-anthropic>=0.2.0 # 如果用Claude
langchain-google-genai>=2.0.0 # 如果用Gemini
langchain-community>=0.3.0
# 安装
pip install langgraph langchain-core langchain-google-genai
OpenHands方案
# OpenHands需要Docker
docker pull ghcr.io/all-hands-ai/openhands:latest
# Python客户端
pip install openhands-ai
🎯 最终推荐
推荐:LangGraph ✨
理由:
- ✅ 完美适配: 图状态机设计天然适合evaluation流程
- ✅ 工具丰富: LangChain生态有大量现成工具可用
- ✅ 轻量级: 不需要Docker,易于部署
- ✅ 可观测: 内置可视化和调试工具
- ✅ 灵活: 可以轻松切换LLM后端(Gemini/GPT/Claude)
- ✅ 生产就绪: Meta/LangChain维护,社区活跃
工具format优势:
# LangChain的工具format是业界标准
@tool
def my_tool(arg1: str, arg2: int) -> dict:
"""Tool description (用于LLM理解)"""
# 实现
return result
# 自动生成JSON Schema
# 自动处理参数验证
# 自动集成到Agent
可扩展性:
- 添加新工具只需定义新函数
- 使用
@tool装饰器自动注册 - 可以组合现有LangChain工具(搜索、API调用等)
🚀 快速开始
# 1. 安装依赖
pip install langgraph langchain-google-genai
# 2. 创建agent文件
cp template_evaluation_agent_langgraph.py evaluation_agent.py
# 3. 配置LocalJobConfig使用新agent
# (见上面的集成示例)
# 4. 运行
python my/run_circle_packing_WITH_agent.py
📊 对比总结表
| 框架 | 适用性 | 学习曲线 | 轻量级 | 工具生态 | 推荐度 |
|---|---|---|---|---|---|
| LangGraph | ⭐⭐⭐⭐⭐ | 中 | ✅ | ⭐⭐⭐⭐⭐ | 🏆 强烈推荐 |
| OpenHands | ⭐⭐⭐☆☆ | 高 | ❌ | ⭐⭐⭐☆☆ | 特殊场景 |
| CrewAI | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐⭐☆☆ | 多agent场景 |
| Autogen | ⭐⭐⭐☆☆ | 高 | ⚠️ | ⭐⭐⭐⭐☆ | 对话场景 |
| Semantic Kernel | ⭐⭐⭐⭐☆ | 中 | ✅ | ⭐⭐⭐☆☆ | 企业场景 |
| 自建 | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐☆☆☆ | 学习目的 |
使用LangGraph,你可以:
- ✅ 借鉴业界标准的工具format(
@tool装饰器) - ✅ 利用丰富的LangChain工具生态
- ✅ 保持代码简洁、可维护
- ✅ 轻松扩展新功能
- ✅ 完美集成到现有系统
开始用LangGraph构建你的Evaluation Agent吧! 🚀