shinka-backup / eval_agent /design_draft /MIGRATION_PLAN.md
JustinTX's picture
Add files using upload-large-folder tool
3f6526a verified

EV2 Migration Plan: From Wrapper to Standalone

🎯 目标

ev2.py 的逻辑完全迁移到 ev2_service_standalone.py,创建一个独立的、完整的评估服务。

设计原则

  1. ✅ 不依赖 ev2.py(完全独立)
  2. ✅ 保留所有现有功能
  3. ✅ 为未来扩展做准备(MetricUnit、Lifecycle 等)
  4. ✅ 更清晰的架构和状态管理

📊 当前架构 vs 目标架构

当前架构(ev2_service.py)

ev2_service.py (HTTP wrapper)
    ↓ 调用
ev2.py (Agent logic)
    ↓ 使用
OpenHands Agent

问题

  • 两层抽象,状态分散
  • 不利于深度集成
  • ev2.py 的设计假设单次运行

目标架构(ev2_service_standalone.py)

ev2_service_standalone.py
├── FastAPI HTTP Server
├── ServiceState (持久化状态管理)
├── IntegratedEV2Agent (直接管理 OpenHands)
│   ├── Agent instance (持久化)
│   ├── Memory management
│   └── Conversation history
└── MetricRegistry (可选,为未来准备)

优势

  • 单一职责,逻辑集中
  • Agent 持久化,无需每次重建
  • 更好的状态管理
  • 为 MetricUnit 等高级功能铺路

🔧 迁移步骤

Phase 1: 核心 Agent 类(优先级最高)

目标:创建 IntegratedEV2Agent 类,替代对 evolution_evaluation_agent() 的调用

1.1 创建 Agent 管理类

# ev2_service_standalone.py

from openhands.agent import Agent
from openhands.llm import LLM
from openhands.tools import TerminalTool, FileEditorTool, TaskTrackerTool
from openhands.tools.tool import Tool

class IntegratedEV2Agent:
    """
    Integrated EV2 Agent (not a wrapper)
    
    Directly manages OpenHands agent lifecycle and state
    """
    
    def __init__(self, 
                 results_dir: str,
                 primary_evaluator_path: str,
                 config: Dict[str, Any]):
        
        self.results_dir = Path(results_dir).resolve()
        self.primary_evaluator_path = Path(primary_evaluator_path).resolve()
        self.config = config
        
        # Memory directory (persistent)
        self.memory_dir = self.results_dir / "eval_agent_memory"
        self.memory_dir.mkdir(parents=True, exist_ok=True)
        
        # Initialize OpenHands agent (persistent!)
        self.agent = self._create_agent()
        
        # Conversation history (accumulates across generations)
        self.conversation_history = []
        
        logger.info(f"✅ IntegratedEV2Agent initialized")
        logger.info(f"   Memory dir: {self.memory_dir}")
        logger.info(f"   Primary evaluator: {self.primary_evaluator_path}")
    
    def _create_agent(self) -> Agent:
        """
        Create OpenHands agent
        
        Migrated from ev2.py:evolution_evaluation_agent()
        """
        # LLM setup
        llm = LLM(model="anthropic/claude-sonnet-4-20250514")
        
        # System prompt
        prompt_path = Path(__file__).parent / "ev2_prompt.j2"
        if not prompt_path.exists():
            raise FileNotFoundError(f"Prompt template not found: {prompt_path}")
        
        # Create agent with tools
        agent = Agent(
            llm=llm,
            tools=[
                Tool(name=TerminalTool.name),
                Tool(name=FileEditorTool.name),
                Tool(name=TaskTrackerTool.name),
            ],
            system_prompt_filename=str(prompt_path),
        )
        
        return agent
    
    async def analyze_generation(self, generation: int) -> Dict[str, Any]:
        """
        Analyze a generation
        
        This is the main entry point, replacing evolution_evaluation_agent()
        """
        logger.info(f"🧠 Analyzing generation {generation}...")
        
        # Build task message
        task = self._build_task_message(generation)
        
        # Run agent
        result = await self._run_agent(task)
        
        # Extract results
        insights = self._extract_insights()
        metrics = self._extract_metrics()
        
        return {
            "success": True,
            "insights": insights,
            "auxiliary_metrics": metrics,
            "generation": generation
        }
    
    def _build_task_message(self, generation: int) -> str:
        """
        Build task message for agent
        
        Migrated from ev2.py:_build_default_task()
        """
        # Read primary evaluator code
        primary_code = ""
        if self.primary_evaluator_path.exists():
            primary_code = self.primary_evaluator_path.read_text()
        
        # Check for generation directory
        gen_dir = self._find_generation_dir(generation)
        
        task = f"""# Evolution Evaluation Task - Generation {generation}

## Your Mission

You are analyzing the evolution process for a code optimization task. Your workspace is:
`{self.memory_dir}`

## Current Generation

Generation: {generation}
Results directory: {gen_dir if gen_dir else 'Not found'}

## Primary Evaluator (Fixed, DO NOT MODIFY)

The ground truth evaluation is defined in:
`{self.primary_evaluator_path}`

**CRITICAL**: You MUST NOT modify this file. Read it to understand the primary objective.

## Your Tasks

1. **READ** the primary evaluator to understand the ground truth objective
2. **ANALYZE** the current generation's performance and strategy
3. **CREATE** auxiliary evaluation metrics that provide insights beyond the primary score
4. **UPDATE** EVAL_AGENTS.md with your findings and recommendations

## Workspace Structure

Your workspace (`{self.memory_dir}`) should contain:
- `EVAL_AGENTS.md`: Your accumulated insights and analysis
- `auxiliary_metrics.py`: Python code for auxiliary metrics
- Any other analysis files you create

## Constraints

- Primary metric is FIXED - you cannot change it
- Auxiliary metrics should complement, not replace, the primary metric
- Focus on actionable insights that can guide the evolution process

## Output

Update EVAL_AGENTS.md with:
- Analysis of generation {generation}
- Auxiliary metric definitions and values
- Insights and recommendations for future generations

Begin your analysis!
"""
        
        return task
    
    def _find_generation_dir(self, generation: int) -> Optional[Path]:
        """Find the generation directory"""
        # Try common patterns
        patterns = [
            self.results_dir / f"gen_{generation}",
            self.results_dir.parent / f"gen_{generation}",
        ]
        
        for pattern in patterns:
            if pattern.exists():
                return pattern
        
        return None
    
    async def _run_agent(self, task: str) -> Dict[str, Any]:
        """
        Run the agent with a task
        
        This is where we'd integrate async execution if needed
        """
        # For now, call synchronously (OpenHands is sync)
        # Could wrap in asyncio.to_thread() for true async
        
        # NOTE: This is simplified - actual OpenHands integration
        # would involve message passing, observation handling, etc.
        # We'll keep it simple for migration
        
        logger.info(f"📝 Task length: {len(task)} chars")
        
        # In ev2.py, the agent is run via Agent's API
        # We'll need to properly integrate this
        
        return {"status": "completed"}
    
    def _extract_insights(self) -> List[str]:
        """Extract insights from EVAL_AGENTS.md"""
        eval_agents_md = self.memory_dir / "EVAL_AGENTS.md"
        
        if not eval_agents_md.exists():
            return []
        
        insights = []
        content = eval_agents_md.read_text()
        
        # Simple extraction - look for bullet points
        for line in content.split('\n'):
            if line.strip().startswith('*') or line.strip().startswith('-'):
                insights.append(line.strip())
        
        return insights[-10:]  # Last 10 insights
    
    def _extract_metrics(self) -> Dict[str, Any]:
        """Extract auxiliary metrics"""
        auxiliary_py = self.memory_dir / "auxiliary_metrics.py"
        
        if not auxiliary_py.exists():
            return {}
        
        # Could dynamically import and execute
        # For now, just check existence
        return {
            "auxiliary_metrics_file_exists": True,
            "file_path": str(auxiliary_py)
        }

1.2 集成到 Service

class EV2ServiceStandalone:
    """
    Standalone EV2 Service (no dependency on ev2.py)
    """
    
    def __init__(self, config: ServiceConfig):
        self.config = config
        self.state = ServiceState(config)
        
        # Create integrated agent (PERSISTENT)
        self.agent = IntegratedEV2Agent(
            results_dir=config.results_dir,
            primary_evaluator_path=config.primary_evaluator_path,
            config=config.__dict__
        )
    
    async def handle_generation_notification(self, request: GenerationCompleteRequest):
        """Handle generation notification"""
        # Decision logic (same as before)
        should_trigger, reason = self.state.should_trigger_agent(...)
        
        if should_trigger:
            # Call integrated agent (not ev2.py!)
            result = await self.agent.analyze_generation(request.generation)
            return result
        
        return {"status": "skipped"}

Phase 2: 完善 Agent 集成(中等优先级)

目标:完整实现 OpenHands agent 的交互逻辑

2.1 消息处理

ev2.py 迁移 agent 运行逻辑:

async def _run_agent(self, task: str) -> Dict[str, Any]:
    """
    Run agent with proper message handling
    
    Migrated from ev2.py (simplified for now)
    """
    # This is where ev2.py uses Agent API
    # We need to properly integrate:
    # 1. Send task as message
    # 2. Handle agent observations
    # 3. Collect agent actions
    # 4. Wait for completion
    
    # For MVP, we can use the same approach as ev2.py
    # but with the persistent agent instance
    
    pass  # TODO: Implement based on OpenHands API

2.2 工作空间管理

def _setup_workspace(self):
    """Setup agent workspace"""
    # Ensure directories exist
    self.memory_dir.mkdir(parents=True, exist_ok=True)
    
    # Initialize EVAL_AGENTS.md if needed
    eval_md = self.memory_dir / "EVAL_AGENTS.md"
    if not eval_md.exists():
        eval_md.write_text("""# Evaluation Agent Memory

This document tracks insights and metrics across generations.
""")

Phase 3: 状态管理增强(低优先级)

目标:为未来的 MetricUnit 等功能做准备

3.1 MetricRegistry(骨架)

class MetricRegistry:
    """
    Registry for managing metrics
    
    Prepared for future MetricUnit integration
    """
    
    def __init__(self, memory_dir: Path):
        self.memory_dir = memory_dir
        self.metrics = {}  # id -> metadata
    
    def register_metric(self, metric_id: str, metadata: Dict[str, Any]):
        """Register a metric"""
        self.metrics[metric_id] = metadata
    
    def list_metrics(self) -> List[Dict[str, Any]]:
        """List all metrics"""
        return list(self.metrics.values())

📁 文件结构

eval_agent/
├── ev2_service_standalone.py    # NEW: 完整的独立服务
├── ev2_service.py               # OLD: 保留作为参考
├── ev2.py                       # OLD: 保留作为独立工具
├── ev2_prompt.j2                # SHARED: 系统 prompt
├── ev2_service_config.yaml      # SHARED: 配置文件
└── test_ev2_service.py          # SHARED: 测试脚本

迁移后

  • ev2_service_standalone.py:生产使用
  • ev2.py:保留作为独立命令行工具(可选)
  • ev2_service.py:删除或重命名为 ev2_service_wrapper.py(存档)

🚀 实施时间表

Day 1: 核心迁移(4-6 小时)

上午

  • 创建 ev2_service_standalone.py 基础结构
  • 实现 IntegratedEV2Agent.__init___create_agent
  • 实现 _build_task_message

下午

  • 实现 analyze_generation 方法
  • 集成到 FastAPI service
  • 修复 import 路径问题

验收:服务能启动,能接收通知,能调用 agent(即使简化版)


Day 2: 完善和测试(4-6 小时)

上午

  • 完善 _run_agent 方法(如果需要)
  • 实现结果提取(_extract_insights, _extract_metrics
  • 添加错误处理

下午

  • 完整测试(使用 test_ev2_service.py
  • 修复发现的问题
  • 性能优化

验收:能完整运行一次演化模拟,agent 正确生成输出


Day 3: 清理和文档(2-4 小时)

上午

  • 代码清理和重构
  • 添加详细注释
  • 更新配置文件

下午

  • 更新文档
  • 创建使用示例
  • 准备集成到 ShinkaEvolve

验收:代码质量高,文档完整,ready for production


📋 迁移 Checklist

从 ev2.py 迁移的内容

  • Agent 创建逻辑

    • LLM 配置
    • Tools 配置
    • System prompt 加载
    • Agent 初始化参数
  • Task 构建逻辑

    • Primary evaluator 路径处理
    • Generation 信息
    • 额外的 context(如果需要)
  • Agent 运行逻辑

    • 消息发送
    • 观察处理
    • 结果等待
  • 结果提取逻辑

    • EVAL_AGENTS.md 解析
    • auxiliary_metrics.py 检测
    • 更复杂的结果解析(可选)
  • 工作空间管理

    • Memory 目录创建
    • 初始文件创建
    • 清理逻辑(可选)

新增功能

  • HTTP API

    • Generation notification endpoint
    • Status endpoint
    • Manual trigger endpoint
  • 状态管理

    • Generation history
    • Trigger decision logic
    • 持久化
  • Agent 持久化

    • Agent instance 复用
    • Conversation history 累积
    • Memory 跨代数共享

配置和部署

  • 配置文件

    • Service 配置
    • Trigger 策略配置
    • Agent 参数配置
  • 测试

    • 基础功能测试
    • 集成测试
    • 性能测试
  • 文档

    • API 文档
    • 迁移文档
    • 使用指南

🎯 迁移的关键挑战

Challenge 1: OpenHands Agent 交互

问题ev2.py 使用 OpenHands 的特定 API,需要理解其工作方式

解决方案

  • 先保持简化版本(调用 agent,等待完成)
  • 逐步完善(如果需要更精细的控制)
  • 参考 ev2.py 的实现

Challenge 2: Agent 状态持久化

问题:每次调用是否需要保持 agent 的上下文?

解决方案

  • Short-term:每次创建新 agent(像 ev2.py 一样)
  • Long-term:复用 agent instance,累积 conversation history

Challenge 3: 错误处理

问题:Agent 可能失败,如何优雅处理?

解决方案

  • Try-catch 包装 agent 调用
  • 记录详细错误日志
  • 返回有意义的错误信息
  • Service 继续运行(不崩溃)

💡 简化策略

为了快速完成迁移,建议采用 渐进式策略

MVP 版本(最小可行)

目标:用最少的改动让服务工作

简化点

  1. Agent 运行:直接调用,不追求最优性能
  2. 结果提取:简单解析(像现在一样)
  3. 状态管理:基础版本即可

时间:1 天

增强版本(生产就绪)

目标:优化性能和用户体验

增强点

  1. Agent 持久化:复用 agent instance
  2. 更好的结果解析:提取更多信息
  3. 错误恢复:健壮的错误处理

时间:+1 天

完整版本(未来扩展)

目标:为高级功能做准备

扩展点

  1. MetricUnit 集成
  2. Lifecycle 管理
  3. 异步 Meta-cognition

时间:+1-2 周(按需)


📊 对比:迁移前 vs 迁移后

方面 迁移前 (wrapper) 迁移后 (standalone)
依赖 依赖 ev2.py 完全独立
架构 两层 单层
状态 分散 集中
Agent 每次创建 可持久化
扩展性 受限
维护性 中等
性能 有开销 优化
代码行数 ~700 ~800-1000

✅ 验收标准

迁移完成的标准:

  1. 功能完整性

    • 所有 ev2.py 的功能都保留
    • HTTP API 正常工作
    • 状态持久化正常
    • Agent 能正确运行
  2. 测试通过

    • test_ev2_service.py 全部通过
    • 模拟 25 代演化成功
    • Agent 生成 EVAL_AGENTS.md 和 auxiliary_metrics.py
  3. 代码质量

    • 无 linter 错误
    • 有充分的注释
    • 结构清晰
  4. 文档完整

    • API 文档更新
    • 使用指南更新
    • 迁移说明清晰

🚀 立即开始

第一步(今天,30 分钟)

  1. 创建 ev2_service_standalone.py 骨架
  2. 复制 ev2_service.py 的 HTTP 部分
  3. 创建 IntegratedEV2Agent 类骨架

第二步(明天上午,2-3 小时)

  1. ev2.py 迁移核心逻辑到 IntegratedEV2Agent
  2. 实现 _create_agent_build_task_message
  3. 简化版的 analyze_generation

第三步(明天下午,2-3 小时)

  1. 完整测试
  2. 修复问题
  3. 文档更新

Ready to start? 我可以帮你创建 ev2_service_standalone.py 的骨架!🚀