| # EV2 Migration Plan: From Wrapper to Standalone |
|
|
| ## 🎯 目标 |
|
|
| 将 `ev2.py` 的逻辑完全迁移到 `ev2_service_standalone.py`,创建一个独立的、完整的评估服务。 |
|
|
| **设计原则**: |
| 1. ✅ 不依赖 `ev2.py`(完全独立) |
| 2. ✅ 保留所有现有功能 |
| 3. ✅ 为未来扩展做准备(MetricUnit、Lifecycle 等) |
| 4. ✅ 更清晰的架构和状态管理 |
|
|
| --- |
|
|
| ## 📊 当前架构 vs 目标架构 |
|
|
| ### 当前架构(ev2_service.py) |
| |
| ``` |
| ev2_service.py (HTTP wrapper) |
| ↓ 调用 |
| ev2.py (Agent logic) |
| ↓ 使用 |
| OpenHands Agent |
| ``` |
| |
| **问题**: |
| - 两层抽象,状态分散 |
| - 不利于深度集成 |
| - ev2.py 的设计假设单次运行 |
|
|
| ### 目标架构(ev2_service_standalone.py) |
|
|
| ``` |
| ev2_service_standalone.py |
| ├── FastAPI HTTP Server |
| ├── ServiceState (持久化状态管理) |
| ├── IntegratedEV2Agent (直接管理 OpenHands) |
| │ ├── Agent instance (持久化) |
| │ ├── Memory management |
| │ └── Conversation history |
| └── MetricRegistry (可选,为未来准备) |
| ``` |
|
|
| **优势**: |
| - 单一职责,逻辑集中 |
| - Agent 持久化,无需每次重建 |
| - 更好的状态管理 |
| - 为 MetricUnit 等高级功能铺路 |
|
|
| --- |
|
|
| ## 🔧 迁移步骤 |
|
|
| ### Phase 1: 核心 Agent 类(优先级最高) |
|
|
| **目标**:创建 `IntegratedEV2Agent` 类,替代对 `evolution_evaluation_agent()` 的调用 |
|
|
| #### 1.1 创建 Agent 管理类 |
|
|
| ```python |
| # ev2_service_standalone.py |
| |
| from openhands.agent import Agent |
| from openhands.llm import LLM |
| from openhands.tools import TerminalTool, FileEditorTool, TaskTrackerTool |
| from openhands.tools.tool import Tool |
| |
| class IntegratedEV2Agent: |
| """ |
| Integrated EV2 Agent (not a wrapper) |
| |
| Directly manages OpenHands agent lifecycle and state |
| """ |
| |
| def __init__(self, |
| results_dir: str, |
| primary_evaluator_path: str, |
| config: Dict[str, Any]): |
| |
| self.results_dir = Path(results_dir).resolve() |
| self.primary_evaluator_path = Path(primary_evaluator_path).resolve() |
| self.config = config |
| |
| # Memory directory (persistent) |
| self.memory_dir = self.results_dir / "eval_agent_memory" |
| self.memory_dir.mkdir(parents=True, exist_ok=True) |
| |
| # Initialize OpenHands agent (persistent!) |
| self.agent = self._create_agent() |
| |
| # Conversation history (accumulates across generations) |
| self.conversation_history = [] |
| |
| logger.info(f"✅ IntegratedEV2Agent initialized") |
| logger.info(f" Memory dir: {self.memory_dir}") |
| logger.info(f" Primary evaluator: {self.primary_evaluator_path}") |
| |
| def _create_agent(self) -> Agent: |
| """ |
| Create OpenHands agent |
| |
| Migrated from ev2.py:evolution_evaluation_agent() |
| """ |
| # LLM setup |
| llm = LLM(model="anthropic/claude-sonnet-4-20250514") |
| |
| # System prompt |
| prompt_path = Path(__file__).parent / "ev2_prompt.j2" |
| if not prompt_path.exists(): |
| raise FileNotFoundError(f"Prompt template not found: {prompt_path}") |
| |
| # Create agent with tools |
| agent = Agent( |
| llm=llm, |
| tools=[ |
| Tool(name=TerminalTool.name), |
| Tool(name=FileEditorTool.name), |
| Tool(name=TaskTrackerTool.name), |
| ], |
| system_prompt_filename=str(prompt_path), |
| ) |
| |
| return agent |
| |
| async def analyze_generation(self, generation: int) -> Dict[str, Any]: |
| """ |
| Analyze a generation |
| |
| This is the main entry point, replacing evolution_evaluation_agent() |
| """ |
| logger.info(f"🧠 Analyzing generation {generation}...") |
| |
| # Build task message |
| task = self._build_task_message(generation) |
| |
| # Run agent |
| result = await self._run_agent(task) |
| |
| # Extract results |
| insights = self._extract_insights() |
| metrics = self._extract_metrics() |
| |
| return { |
| "success": True, |
| "insights": insights, |
| "auxiliary_metrics": metrics, |
| "generation": generation |
| } |
| |
| def _build_task_message(self, generation: int) -> str: |
| """ |
| Build task message for agent |
| |
| Migrated from ev2.py:_build_default_task() |
| """ |
| # Read primary evaluator code |
| primary_code = "" |
| if self.primary_evaluator_path.exists(): |
| primary_code = self.primary_evaluator_path.read_text() |
| |
| # Check for generation directory |
| gen_dir = self._find_generation_dir(generation) |
| |
| task = f"""# Evolution Evaluation Task - Generation {generation} |
| |
| ## Your Mission |
| |
| You are analyzing the evolution process for a code optimization task. Your workspace is: |
| `{self.memory_dir}` |
| |
| ## Current Generation |
| |
| Generation: {generation} |
| Results directory: {gen_dir if gen_dir else 'Not found'} |
| |
| ## Primary Evaluator (Fixed, DO NOT MODIFY) |
| |
| The ground truth evaluation is defined in: |
| `{self.primary_evaluator_path}` |
| |
| **CRITICAL**: You MUST NOT modify this file. Read it to understand the primary objective. |
| |
| ## Your Tasks |
| |
| 1. **READ** the primary evaluator to understand the ground truth objective |
| 2. **ANALYZE** the current generation's performance and strategy |
| 3. **CREATE** auxiliary evaluation metrics that provide insights beyond the primary score |
| 4. **UPDATE** EVAL_AGENTS.md with your findings and recommendations |
| |
| ## Workspace Structure |
| |
| Your workspace (`{self.memory_dir}`) should contain: |
| - `EVAL_AGENTS.md`: Your accumulated insights and analysis |
| - `auxiliary_metrics.py`: Python code for auxiliary metrics |
| - Any other analysis files you create |
| |
| ## Constraints |
| |
| - Primary metric is FIXED - you cannot change it |
| - Auxiliary metrics should complement, not replace, the primary metric |
| - Focus on actionable insights that can guide the evolution process |
| |
| ## Output |
| |
| Update EVAL_AGENTS.md with: |
| - Analysis of generation {generation} |
| - Auxiliary metric definitions and values |
| - Insights and recommendations for future generations |
| |
| Begin your analysis! |
| """ |
| |
| return task |
| |
| def _find_generation_dir(self, generation: int) -> Optional[Path]: |
| """Find the generation directory""" |
| # Try common patterns |
| patterns = [ |
| self.results_dir / f"gen_{generation}", |
| self.results_dir.parent / f"gen_{generation}", |
| ] |
| |
| for pattern in patterns: |
| if pattern.exists(): |
| return pattern |
| |
| return None |
| |
| async def _run_agent(self, task: str) -> Dict[str, Any]: |
| """ |
| Run the agent with a task |
| |
| This is where we'd integrate async execution if needed |
| """ |
| # For now, call synchronously (OpenHands is sync) |
| # Could wrap in asyncio.to_thread() for true async |
| |
| # NOTE: This is simplified - actual OpenHands integration |
| # would involve message passing, observation handling, etc. |
| # We'll keep it simple for migration |
| |
| logger.info(f"📝 Task length: {len(task)} chars") |
| |
| # In ev2.py, the agent is run via Agent's API |
| # We'll need to properly integrate this |
| |
| return {"status": "completed"} |
| |
| def _extract_insights(self) -> List[str]: |
| """Extract insights from EVAL_AGENTS.md""" |
| eval_agents_md = self.memory_dir / "EVAL_AGENTS.md" |
| |
| if not eval_agents_md.exists(): |
| return [] |
| |
| insights = [] |
| content = eval_agents_md.read_text() |
| |
| # Simple extraction - look for bullet points |
| for line in content.split('\n'): |
| if line.strip().startswith('*') or line.strip().startswith('-'): |
| insights.append(line.strip()) |
| |
| return insights[-10:] # Last 10 insights |
| |
| def _extract_metrics(self) -> Dict[str, Any]: |
| """Extract auxiliary metrics""" |
| auxiliary_py = self.memory_dir / "auxiliary_metrics.py" |
| |
| if not auxiliary_py.exists(): |
| return {} |
| |
| # Could dynamically import and execute |
| # For now, just check existence |
| return { |
| "auxiliary_metrics_file_exists": True, |
| "file_path": str(auxiliary_py) |
| } |
| ``` |
|
|
| #### 1.2 集成到 Service |
|
|
| ```python |
| class EV2ServiceStandalone: |
| """ |
| Standalone EV2 Service (no dependency on ev2.py) |
| """ |
| |
| def __init__(self, config: ServiceConfig): |
| self.config = config |
| self.state = ServiceState(config) |
| |
| # Create integrated agent (PERSISTENT) |
| self.agent = IntegratedEV2Agent( |
| results_dir=config.results_dir, |
| primary_evaluator_path=config.primary_evaluator_path, |
| config=config.__dict__ |
| ) |
| |
| async def handle_generation_notification(self, request: GenerationCompleteRequest): |
| """Handle generation notification""" |
| # Decision logic (same as before) |
| should_trigger, reason = self.state.should_trigger_agent(...) |
| |
| if should_trigger: |
| # Call integrated agent (not ev2.py!) |
| result = await self.agent.analyze_generation(request.generation) |
| return result |
| |
| return {"status": "skipped"} |
| ``` |
|
|
| --- |
|
|
| ### Phase 2: 完善 Agent 集成(中等优先级) |
|
|
| **目标**:完整实现 OpenHands agent 的交互逻辑 |
|
|
| #### 2.1 消息处理 |
|
|
| 从 `ev2.py` 迁移 agent 运行逻辑: |
|
|
| ```python |
| async def _run_agent(self, task: str) -> Dict[str, Any]: |
| """ |
| Run agent with proper message handling |
| |
| Migrated from ev2.py (simplified for now) |
| """ |
| # This is where ev2.py uses Agent API |
| # We need to properly integrate: |
| # 1. Send task as message |
| # 2. Handle agent observations |
| # 3. Collect agent actions |
| # 4. Wait for completion |
| |
| # For MVP, we can use the same approach as ev2.py |
| # but with the persistent agent instance |
| |
| pass # TODO: Implement based on OpenHands API |
| ``` |
|
|
| #### 2.2 工作空间管理 |
|
|
| ```python |
| def _setup_workspace(self): |
| """Setup agent workspace""" |
| # Ensure directories exist |
| self.memory_dir.mkdir(parents=True, exist_ok=True) |
| |
| # Initialize EVAL_AGENTS.md if needed |
| eval_md = self.memory_dir / "EVAL_AGENTS.md" |
| if not eval_md.exists(): |
| eval_md.write_text("""# Evaluation Agent Memory |
| |
| This document tracks insights and metrics across generations. |
| """) |
| ``` |
|
|
| --- |
|
|
| ### Phase 3: 状态管理增强(低优先级) |
|
|
| **目标**:为未来的 MetricUnit 等功能做准备 |
|
|
| #### 3.1 MetricRegistry(骨架) |
|
|
| ```python |
| class MetricRegistry: |
| """ |
| Registry for managing metrics |
| |
| Prepared for future MetricUnit integration |
| """ |
| |
| def __init__(self, memory_dir: Path): |
| self.memory_dir = memory_dir |
| self.metrics = {} # id -> metadata |
| |
| def register_metric(self, metric_id: str, metadata: Dict[str, Any]): |
| """Register a metric""" |
| self.metrics[metric_id] = metadata |
| |
| def list_metrics(self) -> List[Dict[str, Any]]: |
| """List all metrics""" |
| return list(self.metrics.values()) |
| ``` |
|
|
| --- |
|
|
| ## 📁 文件结构 |
|
|
| ``` |
| eval_agent/ |
| ├── ev2_service_standalone.py # NEW: 完整的独立服务 |
| ├── ev2_service.py # OLD: 保留作为参考 |
| ├── ev2.py # OLD: 保留作为独立工具 |
| ├── ev2_prompt.j2 # SHARED: 系统 prompt |
| ├── ev2_service_config.yaml # SHARED: 配置文件 |
| └── test_ev2_service.py # SHARED: 测试脚本 |
| ``` |
|
|
| **迁移后**: |
| - `ev2_service_standalone.py`:生产使用 |
| - `ev2.py`:保留作为独立命令行工具(可选) |
| - `ev2_service.py`:删除或重命名为 `ev2_service_wrapper.py`(存档) |
|
|
| --- |
|
|
| ## 🚀 实施时间表 |
|
|
| ### Day 1: 核心迁移(4-6 小时) |
|
|
| **上午**: |
| - [ ] 创建 `ev2_service_standalone.py` 基础结构 |
| - [ ] 实现 `IntegratedEV2Agent.__init__` 和 `_create_agent` |
| - [ ] 实现 `_build_task_message` |
|
|
| **下午**: |
| - [ ] 实现 `analyze_generation` 方法 |
| - [ ] 集成到 FastAPI service |
| - [ ] 修复 import 路径问题 |
|
|
| **验收**:服务能启动,能接收通知,能调用 agent(即使简化版) |
|
|
| --- |
|
|
| ### Day 2: 完善和测试(4-6 小时) |
|
|
| **上午**: |
| - [ ] 完善 `_run_agent` 方法(如果需要) |
| - [ ] 实现结果提取(`_extract_insights`, `_extract_metrics`) |
| - [ ] 添加错误处理 |
|
|
| **下午**: |
| - [ ] 完整测试(使用 `test_ev2_service.py`) |
| - [ ] 修复发现的问题 |
| - [ ] 性能优化 |
|
|
| **验收**:能完整运行一次演化模拟,agent 正确生成输出 |
|
|
| --- |
|
|
| ### Day 3: 清理和文档(2-4 小时) |
|
|
| **上午**: |
| - [ ] 代码清理和重构 |
| - [ ] 添加详细注释 |
| - [ ] 更新配置文件 |
|
|
| **下午**: |
| - [ ] 更新文档 |
| - [ ] 创建使用示例 |
| - [ ] 准备集成到 ShinkaEvolve |
|
|
| **验收**:代码质量高,文档完整,ready for production |
|
|
| --- |
|
|
| ## 📋 迁移 Checklist |
|
|
| ### 从 ev2.py 迁移的内容 |
|
|
| - [ ] **Agent 创建逻辑** |
| - [x] LLM 配置 |
| - [x] Tools 配置 |
| - [x] System prompt 加载 |
| - [ ] Agent 初始化参数 |
|
|
| - [ ] **Task 构建逻辑** |
| - [x] Primary evaluator 路径处理 |
| - [x] Generation 信息 |
| - [ ] 额外的 context(如果需要) |
|
|
| - [ ] **Agent 运行逻辑** |
| - [ ] 消息发送 |
| - [ ] 观察处理 |
| - [ ] 结果等待 |
|
|
| - [ ] **结果提取逻辑** |
| - [x] EVAL_AGENTS.md 解析 |
| - [x] auxiliary_metrics.py 检测 |
| - [ ] 更复杂的结果解析(可选) |
|
|
| - [ ] **工作空间管理** |
| - [x] Memory 目录创建 |
| - [ ] 初始文件创建 |
| - [ ] 清理逻辑(可选) |
|
|
| ### 新增功能 |
|
|
| - [x] **HTTP API** |
| - [x] Generation notification endpoint |
| - [x] Status endpoint |
| - [x] Manual trigger endpoint |
|
|
| - [x] **状态管理** |
| - [x] Generation history |
| - [x] Trigger decision logic |
| - [x] 持久化 |
|
|
| - [ ] **Agent 持久化** |
| - [ ] Agent instance 复用 |
| - [ ] Conversation history 累积 |
| - [ ] Memory 跨代数共享 |
|
|
| ### 配置和部署 |
|
|
| - [x] **配置文件** |
| - [x] Service 配置 |
| - [x] Trigger 策略配置 |
| - [ ] Agent 参数配置 |
|
|
| - [ ] **测试** |
| - [x] 基础功能测试 |
| - [ ] 集成测试 |
| - [ ] 性能测试 |
|
|
| - [ ] **文档** |
| - [x] API 文档 |
| - [ ] 迁移文档 |
| - [ ] 使用指南 |
|
|
| --- |
|
|
| ## 🎯 迁移的关键挑战 |
|
|
| ### Challenge 1: OpenHands Agent 交互 |
|
|
| **问题**:`ev2.py` 使用 OpenHands 的特定 API,需要理解其工作方式 |
|
|
| **解决方案**: |
| - 先保持简化版本(调用 agent,等待完成) |
| - 逐步完善(如果需要更精细的控制) |
| - 参考 `ev2.py` 的实现 |
|
|
| ### Challenge 2: Agent 状态持久化 |
|
|
| **问题**:每次调用是否需要保持 agent 的上下文? |
|
|
| **解决方案**: |
| - **Short-term**:每次创建新 agent(像 ev2.py 一样) |
| - **Long-term**:复用 agent instance,累积 conversation history |
|
|
| ### Challenge 3: 错误处理 |
|
|
| **问题**:Agent 可能失败,如何优雅处理? |
|
|
| **解决方案**: |
| - Try-catch 包装 agent 调用 |
| - 记录详细错误日志 |
| - 返回有意义的错误信息 |
| - Service 继续运行(不崩溃) |
|
|
| --- |
|
|
| ## 💡 简化策略 |
|
|
| 为了快速完成迁移,建议采用 **渐进式策略**: |
|
|
| ### MVP 版本(最小可行) |
|
|
| **目标**:用最少的改动让服务工作 |
|
|
| **简化点**: |
| 1. **Agent 运行**:直接调用,不追求最优性能 |
| 2. **结果提取**:简单解析(像现在一样) |
| 3. **状态管理**:基础版本即可 |
|
|
| **时间**:1 天 |
|
|
| ### 增强版本(生产就绪) |
|
|
| **目标**:优化性能和用户体验 |
|
|
| **增强点**: |
| 1. **Agent 持久化**:复用 agent instance |
| 2. **更好的结果解析**:提取更多信息 |
| 3. **错误恢复**:健壮的错误处理 |
|
|
| **时间**:+1 天 |
|
|
| ### 完整版本(未来扩展) |
|
|
| **目标**:为高级功能做准备 |
|
|
| **扩展点**: |
| 1. **MetricUnit 集成** |
| 2. **Lifecycle 管理** |
| 3. **异步 Meta-cognition** |
|
|
| **时间**:+1-2 周(按需) |
|
|
| --- |
|
|
| ## 📊 对比:迁移前 vs 迁移后 |
|
|
| | 方面 | 迁移前 (wrapper) | 迁移后 (standalone) | |
| |------|-----------------|---------------------| |
| | **依赖** | 依赖 ev2.py | 完全独立 | |
| | **架构** | 两层 | 单层 | |
| | **状态** | 分散 | 集中 | |
| | **Agent** | 每次创建 | 可持久化 | |
| | **扩展性** | 受限 | 高 | |
| | **维护性** | 中等 | 高 | |
| | **性能** | 有开销 | 优化 | |
| | **代码行数** | ~700 | ~800-1000 | |
|
|
| --- |
|
|
| ## ✅ 验收标准 |
|
|
| 迁移完成的标准: |
|
|
| 1. **功能完整性** |
| - [ ] 所有 ev2.py 的功能都保留 |
| - [ ] HTTP API 正常工作 |
| - [ ] 状态持久化正常 |
| - [ ] Agent 能正确运行 |
|
|
| 2. **测试通过** |
| - [ ] `test_ev2_service.py` 全部通过 |
| - [ ] 模拟 25 代演化成功 |
| - [ ] Agent 生成 EVAL_AGENTS.md 和 auxiliary_metrics.py |
|
|
| 3. **代码质量** |
| - [ ] 无 linter 错误 |
| - [ ] 有充分的注释 |
| - [ ] 结构清晰 |
|
|
| 4. **文档完整** |
| - [ ] API 文档更新 |
| - [ ] 使用指南更新 |
| - [ ] 迁移说明清晰 |
|
|
| --- |
|
|
| ## 🚀 立即开始 |
|
|
| ### 第一步(今天,30 分钟) |
|
|
| 1. 创建 `ev2_service_standalone.py` 骨架 |
| 2. 复制 `ev2_service.py` 的 HTTP 部分 |
| 3. 创建 `IntegratedEV2Agent` 类骨架 |
|
|
| ### 第二步(明天上午,2-3 小时) |
|
|
| 1. 从 `ev2.py` 迁移核心逻辑到 `IntegratedEV2Agent` |
| 2. 实现 `_create_agent` 和 `_build_task_message` |
| 3. 简化版的 `analyze_generation` |
|
|
| ### 第三步(明天下午,2-3 小时) |
|
|
| 1. 完整测试 |
| 2. 修复问题 |
| 3. 文档更新 |
|
|
| --- |
|
|
| **Ready to start?** 我可以帮你创建 `ev2_service_standalone.py` 的骨架!🚀 |
|
|