File size: 19,868 Bytes
1556404 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 | # Agent Framework 对比与选型
## 🎯 需求回顾
我们需要一个Agent框架来实现Evaluation Agent,核心需求:
1. **工具调用系统**: 需要执行多种工具(运行程序、查询数据库、生成代码等)
2. **LLM集成**: 需要调用多种LLM(Gemini、GPT等)
3. **沙箱执行**: 需要安全执行LLM生成的代码
4. **可扩展性**: 容易添加新工具
5. **接口兼容**: 能封装成命令行工具,输出JSON文件
6. **成本控制**: 能限制API调用次数和成本
---
## 📊 开源Agent框架对比
### 1. **OpenHands** (原OpenDevin)
**官网**: https://github.com/All-Hands-AI/OpenHands
**特点**:
```python
✅ 优势:
- 专注于代码相关任务(非常适合evaluation场景)
- 内置沙箱环境(Docker-based)
- 支持多种LLM后端
- Agent可以执行bash命令、读写文件、运行Python
- 强大的工具系统
❌ 劣势:
- 较重量级(需要Docker环境)
- 主要设计用于软件开发任务
- 可能over-engineering for evaluation场景
```
**适用场景**: ⭐⭐⭐☆☆ (3.5/5)
- 如果需要复杂的代码生成和执行
- 需要隔离的沙箱环境
- 但对于evaluation任务可能有点重
---
### 2. **LangGraph** (LangChain生态)
**官网**: https://github.com/langchain-ai/langgraph
**特点**:
```python
✅ 优势:
- 图状态机设计,流程清晰
- LangChain生态,工具丰富
- 轻量级,易于集成
- 状态持久化
- 支持循环和条件分支
- 强大的工具调用能力
✅ 特别适合:
- 需要复杂决策流程的Agent
- 多步推理和工具调用
- 状态管理和回溯
```
**示例架构**:
```python
from langgraph.graph import StateGraph, END
# 定义状态
class EvaluationState(TypedDict):
program_path: str
results_dir: str
db_context: dict
metrics: dict
reasoning: list
# 定义节点(每个节点是一个工具调用或决策点)
def run_program(state):
result = run_shinka_eval(state["program_path"])
return {"metrics": result}
def query_database(state):
context = db.get_historical_context()
return {"db_context": context}
def llm_analyze(state):
plan = llm.plan(state["db_context"], state["metrics"])
return {"reasoning": plan}
# 构建图
workflow = StateGraph(EvaluationState)
workflow.add_node("run_program", run_program)
workflow.add_node("query_db", query_database)
workflow.add_node("analyze", llm_analyze)
workflow.add_edge("run_program", "query_db")
workflow.add_edge("query_db", "analyze")
workflow.add_edge("analyze", END)
agent = workflow.compile()
```
**适用场景**: ⭐⭐⭐⭐⭐ (5/5)
- **强烈推荐!** 完美契合evaluation agent需求
- 轻量级但功能完整
- 流程清晰,易于调试
---
### 3. **CrewAI**
**官网**: https://github.com/joaomdmoura/crewAI
**特点**:
```python
✅ 优势:
- 多Agent协作(可以有专门的evaluator、analyzer等角色)
- 内置角色和任务系统
- 易于使用
❌ 劣势:
- 主要用于多Agent场景
- 对于单一evaluation agent可能过度设计
```
**适用场景**: ⭐⭐⭐☆☆ (3/5)
- 如果未来要扩展为多个评估agent协作
- 但目前单agent场景可能不需要
---
### 4. **Autogen** (Microsoft)
**官网**: https://github.com/microsoft/autogen
**特点**:
```python
✅ 优势:
- 微软出品,维护良好
- 支持多Agent对话
- 代码执行环境
- 工具调用系统
❌ 劣势:
- 主要为对话场景设计
- 相对复杂
```
**适用场景**: ⭐⭐⭐☆☆ (3/5)
---
### 5. **Semantic Kernel** (Microsoft)
**官网**: https://github.com/microsoft/semantic-kernel
**特点**:
```python
✅ 优势:
- 轻量级插件系统
- 多语言支持(Python, C#, Java)
- 企业级设计
- 函数调用和规划
❌ 劣势:
- 相对底层,需要更多自己实现
- 文档相对少
```
**适用场景**: ⭐⭐⭐⭐☆ (4/5)
---
### 6. **自建轻量Agent框架**
**特点**:
```python
✅ 优势:
- 完全控制
- 轻量级
- 针对性强
- 无额外依赖
❌ 劣势:
- 需要从头实现工具系统
- 缺少现成的patterns
```
**适用场景**: ⭐⭐⭐☆☆ (3/5)
---
## 🏆 推荐方案
### **方案A: LangGraph (强烈推荐) ⭐⭐⭐⭐⭐**
**理由**:
1. ✅ **完美契合需求**: 图状态机设计天然适合evaluation流程
2. ✅ **工具生态丰富**: LangChain有大量现成工具
3. ✅ **轻量级**: 不需要Docker等重型依赖
4. ✅ **可观测性强**: 内置状态追踪和可视化
5. ✅ **易于扩展**: 添加新工具只需定义新节点
6. ✅ **成本控制**: 支持token计数和预算限制
**架构示例**:
```python
# evaluation_agent_langgraph.py
from typing import TypedDict, Annotated, List
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolExecutor
from langchain_core.tools import tool
import operator
# ============================================================================
# 1. 定义状态
# ============================================================================
class EvaluationState(TypedDict):
"""Agent的状态"""
# 输入
program_path: str
results_dir: str
db_path: str
# 中间状态
program_result: dict
db_context: dict
auxiliary_metrics: dict
# Agent推理
plan: List[str]
tool_calls: Annotated[list, operator.add] # 累积工具调用
# 输出
final_metrics: dict
correct: bool
error: str | None
# 元信息
total_cost: float
reasoning: List[str]
# ============================================================================
# 2. 定义工具(LangChain格式)
# ============================================================================
@tool
def run_program(program_path: str) -> dict:
"""Run the program and get raw results (centers, radii, score)"""
from shinka.core import run_shinka_eval
# 调用现有evaluation逻辑
result = run_shinka_eval(
program_path=program_path,
results_dir="temp",
experiment_fn_name="run_packing",
num_runs=1,
# ...
)
return {
"centers": result["centers"].tolist(),
"radii": result["radii"].tolist(),
"score": result["score"]
}
@tool
def validate_packing(centers: list, radii: list) -> dict:
"""Validate if packing satisfies all constraints"""
from examples.circle_packing.evaluate import adapted_validate_packing
is_valid, error = adapted_validate_packing((
np.array(centers),
np.array(radii),
sum(radii)
))
return {"valid": is_valid, "error": error}
@tool
def query_historical_best(db_path: str, metric: str = "combined_score") -> dict:
"""Query the best historical program from database"""
from shinka.database import ProgramDatabase
db = ProgramDatabase.load(db_path)
best_program = db.get_best_program(metric=metric)
return {
"id": best_program.id,
"score": best_program.combined_score,
"generation": best_program.generation,
"metrics": best_program.public_metrics
}
@tool
def compute_auxiliary_metric(metric_name: str, centers: list, radii: list) -> dict:
"""Compute a predefined auxiliary metric"""
from examples.circle_packing.auxiliary_eval import METRIC_REGISTRY
metric_func = METRIC_REGISTRY.get(metric_name)
if not metric_func:
return {"error": f"Metric {metric_name} not found"}
result = metric_func(np.array(centers), np.array(radii))
return {
"name": result.name,
"value": result.value,
"interpretation": result.interpretation,
"details": result.details
}
@tool
def generate_new_metric_code(purpose: str, context: dict) -> dict:
"""Use LLM to generate code for a new evaluation metric"""
from shinka.llm import LLM
llm = LLM("native-gemini-2.5-pro")
prompt = f"""
Generate a Python function for a new circle packing evaluation metric.
Purpose: {purpose}
Current context: {context}
Requirements:
1. Function signature: def metric_name(centers: np.ndarray, radii: np.ndarray) -> MetricResult
2. Use numpy for computations
3. Return MetricResult with name, value, interpretation, description
Generate clean, executable code.
"""
response = llm.query(prompt)
code = extract_code(response.content)
return {"code": code, "cost": response.cost}
# ============================================================================
# 3. 定义节点(Agent的行为)
# ============================================================================
def run_program_node(state: EvaluationState) -> dict:
"""执行程序并获取结果"""
result = run_program.invoke({"program_path": state["program_path"]})
return {
"program_result": result,
"tool_calls": [{"tool": "run_program", "result": result}]
}
def query_database_node(state: EvaluationState) -> dict:
"""查询数据库获取历史context"""
if not state.get("db_path"):
return {"db_context": {}}
best = query_historical_best.invoke({"db_path": state["db_path"]})
return {
"db_context": {"best_program": best},
"tool_calls": [{"tool": "query_historical_best", "result": best}]
}
def llm_planning_node(state: EvaluationState) -> dict:
"""LLM规划下一步评估策略"""
from shinka.llm import LLM
llm = LLM("native-gemini-2.5-pro", temperature=0.7)
prompt = f"""
You are evaluating a circle packing program.
Current result:
- Score: {state["program_result"]["score"]}
- Number of circles: {len(state["program_result"]["centers"])}
Historical context:
- Best score: {state["db_context"].get("best_program", {}).get("score", "unknown")}
Available tools:
- compute_auxiliary_metric: Compute metrics like packing_efficiency, gap_analysis, etc.
- generate_new_metric_code: Generate new evaluation metric if needed
Decide which metrics to compute and what analysis to perform.
Output a JSON plan with steps.
"""
response = llm.query(prompt)
plan = parse_plan(response.content)
return {
"plan": plan,
"reasoning": [f"LLM planning: {response.content}"],
"total_cost": state.get("total_cost", 0) + response.cost
}
def execute_metrics_node(state: EvaluationState) -> dict:
"""执行auxiliary metrics计算"""
metrics = {}
for metric_name in state.get("plan", []):
if metric_name.startswith("compute_"):
metric_result = compute_auxiliary_metric.invoke({
"metric_name": metric_name.replace("compute_", ""),
"centers": state["program_result"]["centers"],
"radii": state["program_result"]["radii"]
})
metrics[metric_name] = metric_result
return {
"auxiliary_metrics": metrics,
"tool_calls": [{"tool": "compute_auxiliary_metric", "results": metrics}]
}
def generate_feedback_node(state: EvaluationState) -> dict:
"""生成最终反馈"""
from shinka.llm import LLM
llm = LLM("native-gemini-2.5-pro", temperature=0.7)
prompt = f"""
Generate evaluation feedback for a circle packing program.
Results:
- Primary score: {state["program_result"]["score"]}
- Auxiliary metrics: {state["auxiliary_metrics"]}
- Historical best: {state["db_context"].get("best_program", {}).get("score")}
Provide:
1. Performance summary
2. Comparison with historical best
3. Specific actionable recommendations
"""
response = llm.query(prompt)
return {
"final_metrics": {
"combined_score": state["program_result"]["score"],
"public": {
"num_circles": len(state["program_result"]["centers"]),
**state["auxiliary_metrics"]
},
"text_feedback": response.content
},
"correct": True,
"error": None,
"total_cost": state.get("total_cost", 0) + response.cost
}
def save_results_node(state: EvaluationState) -> dict:
"""保存结果到文件(保证接口兼容)"""
from shinka.core.wrap_eval import save_json_results
save_json_results(
results_dir=state["results_dir"],
metrics=state["final_metrics"],
correct=state["correct"],
error=state["error"]
)
# 保存Agent推理过程
import json
with open(f"{state['results_dir']}/agent_reasoning.json", "w") as f:
json.dump({
"plan": state.get("plan", []),
"tool_calls": state.get("tool_calls", []),
"reasoning": state.get("reasoning", []),
"total_cost": state.get("total_cost", 0)
}, f, indent=2)
return {}
# ============================================================================
# 4. 构建图
# ============================================================================
def create_evaluation_agent():
"""创建evaluation agent workflow"""
workflow = StateGraph(EvaluationState)
# 添加节点
workflow.add_node("run_program", run_program_node)
workflow.add_node("query_database", query_database_node)
workflow.add_node("llm_planning", llm_planning_node)
workflow.add_node("execute_metrics", execute_metrics_node)
workflow.add_node("generate_feedback", generate_feedback_node)
workflow.add_node("save_results", save_results_node)
# 定义流程
workflow.set_entry_point("run_program")
workflow.add_edge("run_program", "query_database")
workflow.add_edge("query_database", "llm_planning")
workflow.add_edge("llm_planning", "execute_metrics")
workflow.add_edge("execute_metrics", "generate_feedback")
workflow.add_edge("generate_feedback", "save_results")
workflow.add_edge("save_results", END)
return workflow.compile()
# ============================================================================
# 5. 主入口(命令行兼容)
# ============================================================================
def main():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--program_path", required=True)
parser.add_argument("--results_dir", required=True)
parser.add_argument("--db_path", default=None)
parser.add_argument("--agent_mode", default="adaptive")
args = parser.parse_args()
# 创建agent
agent = create_evaluation_agent()
# 初始状态
initial_state = {
"program_path": args.program_path,
"results_dir": args.results_dir,
"db_path": args.db_path,
"tool_calls": [],
"total_cost": 0.0,
"reasoning": []
}
# 运行agent
final_state = agent.invoke(initial_state)
print(f"✅ Evaluation completed!")
print(f"Score: {final_state['final_metrics']['combined_score']}")
print(f"Total cost: ${final_state['total_cost']:.4f}")
print(f"Results saved to: {args.results_dir}")
if __name__ == "__main__":
main()
```
**集成到现有系统**:
```python
# my/run_circle_packing_WITH_agent.py
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.database import DatabaseConfig
from shinka.launch import LocalJobConfig
# 使用LangGraph agent作为evaluator
job_config = LocalJobConfig(
eval_program_path="evaluation_agent_langgraph.py", # LangGraph agent
extra_cmd_args={
"db_path": "auto", # 自动传递数据库路径
"agent_mode": "adaptive"
}
)
db_config = DatabaseConfig(
db_path="evolution_db.sqlite",
# ... 其他配置
)
evo_config = EvolutionConfig(
use_text_feedback=True,
# ... 其他配置
)
runner = EvolutionRunner(
job_config=job_config,
db_config=db_config,
evo_config=evo_config
)
runner.run()
```
---
### **方案B: OpenHands (如果需要强隔离)** ⭐⭐⭐☆☆
**使用场景**: 如果需要:
- 完全隔离的沙箱环境
- 执行不受信任的代码
- 复杂的代码生成和测试
**架构**:
```python
from openhands.core import Agent, Task
class EvaluationAgent(Agent):
def __init__(self, workspace_dir):
super().__init__(workspace_dir)
async def evaluate(self, program_path, results_dir):
# 在沙箱中运行evaluation
task = Task(
instruction=f"Evaluate the program at {program_path}",
workspace=self.workspace
)
result = await self.execute(task)
return result
```
**劣势**: 需要Docker环境,较重量级
---
### **方案C: 混合方案(轻量+可选重量)** ⭐⭐⭐⭐☆
**策略**:
- 核心用**LangGraph**(轻量级)
- 需要沙箱时可选调用**OpenHands**
- 最佳灵活性
```python
# 核心用LangGraph
agent = create_langgraph_agent()
# 某些工具可以选择性使用OpenHands
@tool
def execute_untrusted_code(code: str) -> dict:
"""Execute potentially untrusted code in OpenHands sandbox"""
if use_sandbox:
return openhands_execute(code)
else:
return safe_local_execute(code)
```
---
## 📦 依赖安装
### LangGraph方案
```bash
# requirements.txt
langgraph>=0.2.0
langchain-core>=0.3.0
langchain-anthropic>=0.2.0 # 如果用Claude
langchain-google-genai>=2.0.0 # 如果用Gemini
langchain-community>=0.3.0
# 安装
pip install langgraph langchain-core langchain-google-genai
```
### OpenHands方案
```bash
# OpenHands需要Docker
docker pull ghcr.io/all-hands-ai/openhands:latest
# Python客户端
pip install openhands-ai
```
---
## 🎯 最终推荐
### **推荐:LangGraph ✨**
**理由**:
1. ✅ **完美适配**: 图状态机设计天然适合evaluation流程
2. ✅ **工具丰富**: LangChain生态有大量现成工具可用
3. ✅ **轻量级**: 不需要Docker,易于部署
4. ✅ **可观测**: 内置可视化和调试工具
5. ✅ **灵活**: 可以轻松切换LLM后端(Gemini/GPT/Claude)
6. ✅ **生产就绪**: Meta/LangChain维护,社区活跃
**工具format优势**:
```python
# LangChain的工具format是业界标准
@tool
def my_tool(arg1: str, arg2: int) -> dict:
"""Tool description (用于LLM理解)"""
# 实现
return result
# 自动生成JSON Schema
# 自动处理参数验证
# 自动集成到Agent
```
**可扩展性**:
- 添加新工具只需定义新函数
- 使用`@tool`装饰器自动注册
- 可以组合现有LangChain工具(搜索、API调用等)
---
## 🚀 快速开始
```bash
# 1. 安装依赖
pip install langgraph langchain-google-genai
# 2. 创建agent文件
cp template_evaluation_agent_langgraph.py evaluation_agent.py
# 3. 配置LocalJobConfig使用新agent
# (见上面的集成示例)
# 4. 运行
python my/run_circle_packing_WITH_agent.py
```
---
## 📊 对比总结表
| 框架 | 适用性 | 学习曲线 | 轻量级 | 工具生态 | 推荐度 |
|------|--------|----------|--------|----------|--------|
| **LangGraph** | ⭐⭐⭐⭐⭐ | 中 | ✅ | ⭐⭐⭐⭐⭐ | **🏆 强烈推荐** |
| OpenHands | ⭐⭐⭐☆☆ | 高 | ❌ | ⭐⭐⭐☆☆ | 特殊场景 |
| CrewAI | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐⭐☆☆ | 多agent场景 |
| Autogen | ⭐⭐⭐☆☆ | 高 | ⚠️ | ⭐⭐⭐⭐☆ | 对话场景 |
| Semantic Kernel | ⭐⭐⭐⭐☆ | 中 | ✅ | ⭐⭐⭐☆☆ | 企业场景 |
| 自建 | ⭐⭐⭐☆☆ | 低 | ✅ | ⭐⭐☆☆☆ | 学习目的 |
---
使用**LangGraph**,你可以:
1. ✅ 借鉴业界标准的工具format(`@tool`装饰器)
2. ✅ 利用丰富的LangChain工具生态
3. ✅ 保持代码简洁、可维护
4. ✅ 轻松扩展新功能
5. ✅ 完美集成到现有系统
**开始用LangGraph构建你的Evaluation Agent吧!** 🚀
|