shinka-backup / eval_agent /design_draft /MIGRATION_PLAN.md
JustinTX's picture
Add files using upload-large-folder tool
3f6526a verified
# EV2 Migration Plan: From Wrapper to Standalone
## 🎯 目标
`ev2.py` 的逻辑完全迁移到 `ev2_service_standalone.py`,创建一个独立的、完整的评估服务。
**设计原则**
1. ✅ 不依赖 `ev2.py`(完全独立)
2. ✅ 保留所有现有功能
3. ✅ 为未来扩展做准备(MetricUnit、Lifecycle 等)
4. ✅ 更清晰的架构和状态管理
---
## 📊 当前架构 vs 目标架构
### 当前架构(ev2_service.py)
```
ev2_service.py (HTTP wrapper)
↓ 调用
ev2.py (Agent logic)
↓ 使用
OpenHands Agent
```
**问题**
- 两层抽象,状态分散
- 不利于深度集成
- ev2.py 的设计假设单次运行
### 目标架构(ev2_service_standalone.py)
```
ev2_service_standalone.py
├── FastAPI HTTP Server
├── ServiceState (持久化状态管理)
├── IntegratedEV2Agent (直接管理 OpenHands)
│ ├── Agent instance (持久化)
│ ├── Memory management
│ └── Conversation history
└── MetricRegistry (可选,为未来准备)
```
**优势**
- 单一职责,逻辑集中
- Agent 持久化,无需每次重建
- 更好的状态管理
- 为 MetricUnit 等高级功能铺路
---
## 🔧 迁移步骤
### Phase 1: 核心 Agent 类(优先级最高)
**目标**:创建 `IntegratedEV2Agent` 类,替代对 `evolution_evaluation_agent()` 的调用
#### 1.1 创建 Agent 管理类
```python
# ev2_service_standalone.py
from openhands.agent import Agent
from openhands.llm import LLM
from openhands.tools import TerminalTool, FileEditorTool, TaskTrackerTool
from openhands.tools.tool import Tool
class IntegratedEV2Agent:
"""
Integrated EV2 Agent (not a wrapper)
Directly manages OpenHands agent lifecycle and state
"""
def __init__(self,
results_dir: str,
primary_evaluator_path: str,
config: Dict[str, Any]):
self.results_dir = Path(results_dir).resolve()
self.primary_evaluator_path = Path(primary_evaluator_path).resolve()
self.config = config
# Memory directory (persistent)
self.memory_dir = self.results_dir / "eval_agent_memory"
self.memory_dir.mkdir(parents=True, exist_ok=True)
# Initialize OpenHands agent (persistent!)
self.agent = self._create_agent()
# Conversation history (accumulates across generations)
self.conversation_history = []
logger.info(f"✅ IntegratedEV2Agent initialized")
logger.info(f" Memory dir: {self.memory_dir}")
logger.info(f" Primary evaluator: {self.primary_evaluator_path}")
def _create_agent(self) -> Agent:
"""
Create OpenHands agent
Migrated from ev2.py:evolution_evaluation_agent()
"""
# LLM setup
llm = LLM(model="anthropic/claude-sonnet-4-20250514")
# System prompt
prompt_path = Path(__file__).parent / "ev2_prompt.j2"
if not prompt_path.exists():
raise FileNotFoundError(f"Prompt template not found: {prompt_path}")
# Create agent with tools
agent = Agent(
llm=llm,
tools=[
Tool(name=TerminalTool.name),
Tool(name=FileEditorTool.name),
Tool(name=TaskTrackerTool.name),
],
system_prompt_filename=str(prompt_path),
)
return agent
async def analyze_generation(self, generation: int) -> Dict[str, Any]:
"""
Analyze a generation
This is the main entry point, replacing evolution_evaluation_agent()
"""
logger.info(f"🧠 Analyzing generation {generation}...")
# Build task message
task = self._build_task_message(generation)
# Run agent
result = await self._run_agent(task)
# Extract results
insights = self._extract_insights()
metrics = self._extract_metrics()
return {
"success": True,
"insights": insights,
"auxiliary_metrics": metrics,
"generation": generation
}
def _build_task_message(self, generation: int) -> str:
"""
Build task message for agent
Migrated from ev2.py:_build_default_task()
"""
# Read primary evaluator code
primary_code = ""
if self.primary_evaluator_path.exists():
primary_code = self.primary_evaluator_path.read_text()
# Check for generation directory
gen_dir = self._find_generation_dir(generation)
task = f"""# Evolution Evaluation Task - Generation {generation}
## Your Mission
You are analyzing the evolution process for a code optimization task. Your workspace is:
`{self.memory_dir}`
## Current Generation
Generation: {generation}
Results directory: {gen_dir if gen_dir else 'Not found'}
## Primary Evaluator (Fixed, DO NOT MODIFY)
The ground truth evaluation is defined in:
`{self.primary_evaluator_path}`
**CRITICAL**: You MUST NOT modify this file. Read it to understand the primary objective.
## Your Tasks
1. **READ** the primary evaluator to understand the ground truth objective
2. **ANALYZE** the current generation's performance and strategy
3. **CREATE** auxiliary evaluation metrics that provide insights beyond the primary score
4. **UPDATE** EVAL_AGENTS.md with your findings and recommendations
## Workspace Structure
Your workspace (`{self.memory_dir}`) should contain:
- `EVAL_AGENTS.md`: Your accumulated insights and analysis
- `auxiliary_metrics.py`: Python code for auxiliary metrics
- Any other analysis files you create
## Constraints
- Primary metric is FIXED - you cannot change it
- Auxiliary metrics should complement, not replace, the primary metric
- Focus on actionable insights that can guide the evolution process
## Output
Update EVAL_AGENTS.md with:
- Analysis of generation {generation}
- Auxiliary metric definitions and values
- Insights and recommendations for future generations
Begin your analysis!
"""
return task
def _find_generation_dir(self, generation: int) -> Optional[Path]:
"""Find the generation directory"""
# Try common patterns
patterns = [
self.results_dir / f"gen_{generation}",
self.results_dir.parent / f"gen_{generation}",
]
for pattern in patterns:
if pattern.exists():
return pattern
return None
async def _run_agent(self, task: str) -> Dict[str, Any]:
"""
Run the agent with a task
This is where we'd integrate async execution if needed
"""
# For now, call synchronously (OpenHands is sync)
# Could wrap in asyncio.to_thread() for true async
# NOTE: This is simplified - actual OpenHands integration
# would involve message passing, observation handling, etc.
# We'll keep it simple for migration
logger.info(f"📝 Task length: {len(task)} chars")
# In ev2.py, the agent is run via Agent's API
# We'll need to properly integrate this
return {"status": "completed"}
def _extract_insights(self) -> List[str]:
"""Extract insights from EVAL_AGENTS.md"""
eval_agents_md = self.memory_dir / "EVAL_AGENTS.md"
if not eval_agents_md.exists():
return []
insights = []
content = eval_agents_md.read_text()
# Simple extraction - look for bullet points
for line in content.split('\n'):
if line.strip().startswith('*') or line.strip().startswith('-'):
insights.append(line.strip())
return insights[-10:] # Last 10 insights
def _extract_metrics(self) -> Dict[str, Any]:
"""Extract auxiliary metrics"""
auxiliary_py = self.memory_dir / "auxiliary_metrics.py"
if not auxiliary_py.exists():
return {}
# Could dynamically import and execute
# For now, just check existence
return {
"auxiliary_metrics_file_exists": True,
"file_path": str(auxiliary_py)
}
```
#### 1.2 集成到 Service
```python
class EV2ServiceStandalone:
"""
Standalone EV2 Service (no dependency on ev2.py)
"""
def __init__(self, config: ServiceConfig):
self.config = config
self.state = ServiceState(config)
# Create integrated agent (PERSISTENT)
self.agent = IntegratedEV2Agent(
results_dir=config.results_dir,
primary_evaluator_path=config.primary_evaluator_path,
config=config.__dict__
)
async def handle_generation_notification(self, request: GenerationCompleteRequest):
"""Handle generation notification"""
# Decision logic (same as before)
should_trigger, reason = self.state.should_trigger_agent(...)
if should_trigger:
# Call integrated agent (not ev2.py!)
result = await self.agent.analyze_generation(request.generation)
return result
return {"status": "skipped"}
```
---
### Phase 2: 完善 Agent 集成(中等优先级)
**目标**:完整实现 OpenHands agent 的交互逻辑
#### 2.1 消息处理
`ev2.py` 迁移 agent 运行逻辑:
```python
async def _run_agent(self, task: str) -> Dict[str, Any]:
"""
Run agent with proper message handling
Migrated from ev2.py (simplified for now)
"""
# This is where ev2.py uses Agent API
# We need to properly integrate:
# 1. Send task as message
# 2. Handle agent observations
# 3. Collect agent actions
# 4. Wait for completion
# For MVP, we can use the same approach as ev2.py
# but with the persistent agent instance
pass # TODO: Implement based on OpenHands API
```
#### 2.2 工作空间管理
```python
def _setup_workspace(self):
"""Setup agent workspace"""
# Ensure directories exist
self.memory_dir.mkdir(parents=True, exist_ok=True)
# Initialize EVAL_AGENTS.md if needed
eval_md = self.memory_dir / "EVAL_AGENTS.md"
if not eval_md.exists():
eval_md.write_text("""# Evaluation Agent Memory
This document tracks insights and metrics across generations.
""")
```
---
### Phase 3: 状态管理增强(低优先级)
**目标**:为未来的 MetricUnit 等功能做准备
#### 3.1 MetricRegistry(骨架)
```python
class MetricRegistry:
"""
Registry for managing metrics
Prepared for future MetricUnit integration
"""
def __init__(self, memory_dir: Path):
self.memory_dir = memory_dir
self.metrics = {} # id -> metadata
def register_metric(self, metric_id: str, metadata: Dict[str, Any]):
"""Register a metric"""
self.metrics[metric_id] = metadata
def list_metrics(self) -> List[Dict[str, Any]]:
"""List all metrics"""
return list(self.metrics.values())
```
---
## 📁 文件结构
```
eval_agent/
├── ev2_service_standalone.py # NEW: 完整的独立服务
├── ev2_service.py # OLD: 保留作为参考
├── ev2.py # OLD: 保留作为独立工具
├── ev2_prompt.j2 # SHARED: 系统 prompt
├── ev2_service_config.yaml # SHARED: 配置文件
└── test_ev2_service.py # SHARED: 测试脚本
```
**迁移后**
- `ev2_service_standalone.py`:生产使用
- `ev2.py`:保留作为独立命令行工具(可选)
- `ev2_service.py`:删除或重命名为 `ev2_service_wrapper.py`(存档)
---
## 🚀 实施时间表
### Day 1: 核心迁移(4-6 小时)
**上午**
- [ ] 创建 `ev2_service_standalone.py` 基础结构
- [ ] 实现 `IntegratedEV2Agent.__init__``_create_agent`
- [ ] 实现 `_build_task_message`
**下午**
- [ ] 实现 `analyze_generation` 方法
- [ ] 集成到 FastAPI service
- [ ] 修复 import 路径问题
**验收**:服务能启动,能接收通知,能调用 agent(即使简化版)
---
### Day 2: 完善和测试(4-6 小时)
**上午**
- [ ] 完善 `_run_agent` 方法(如果需要)
- [ ] 实现结果提取(`_extract_insights`, `_extract_metrics`
- [ ] 添加错误处理
**下午**
- [ ] 完整测试(使用 `test_ev2_service.py`
- [ ] 修复发现的问题
- [ ] 性能优化
**验收**:能完整运行一次演化模拟,agent 正确生成输出
---
### Day 3: 清理和文档(2-4 小时)
**上午**
- [ ] 代码清理和重构
- [ ] 添加详细注释
- [ ] 更新配置文件
**下午**
- [ ] 更新文档
- [ ] 创建使用示例
- [ ] 准备集成到 ShinkaEvolve
**验收**:代码质量高,文档完整,ready for production
---
## 📋 迁移 Checklist
### 从 ev2.py 迁移的内容
- [ ] **Agent 创建逻辑**
- [x] LLM 配置
- [x] Tools 配置
- [x] System prompt 加载
- [ ] Agent 初始化参数
- [ ] **Task 构建逻辑**
- [x] Primary evaluator 路径处理
- [x] Generation 信息
- [ ] 额外的 context(如果需要)
- [ ] **Agent 运行逻辑**
- [ ] 消息发送
- [ ] 观察处理
- [ ] 结果等待
- [ ] **结果提取逻辑**
- [x] EVAL_AGENTS.md 解析
- [x] auxiliary_metrics.py 检测
- [ ] 更复杂的结果解析(可选)
- [ ] **工作空间管理**
- [x] Memory 目录创建
- [ ] 初始文件创建
- [ ] 清理逻辑(可选)
### 新增功能
- [x] **HTTP API**
- [x] Generation notification endpoint
- [x] Status endpoint
- [x] Manual trigger endpoint
- [x] **状态管理**
- [x] Generation history
- [x] Trigger decision logic
- [x] 持久化
- [ ] **Agent 持久化**
- [ ] Agent instance 复用
- [ ] Conversation history 累积
- [ ] Memory 跨代数共享
### 配置和部署
- [x] **配置文件**
- [x] Service 配置
- [x] Trigger 策略配置
- [ ] Agent 参数配置
- [ ] **测试**
- [x] 基础功能测试
- [ ] 集成测试
- [ ] 性能测试
- [ ] **文档**
- [x] API 文档
- [ ] 迁移文档
- [ ] 使用指南
---
## 🎯 迁移的关键挑战
### Challenge 1: OpenHands Agent 交互
**问题**`ev2.py` 使用 OpenHands 的特定 API,需要理解其工作方式
**解决方案**
- 先保持简化版本(调用 agent,等待完成)
- 逐步完善(如果需要更精细的控制)
- 参考 `ev2.py` 的实现
### Challenge 2: Agent 状态持久化
**问题**:每次调用是否需要保持 agent 的上下文?
**解决方案**
- **Short-term**:每次创建新 agent(像 ev2.py 一样)
- **Long-term**:复用 agent instance,累积 conversation history
### Challenge 3: 错误处理
**问题**:Agent 可能失败,如何优雅处理?
**解决方案**
- Try-catch 包装 agent 调用
- 记录详细错误日志
- 返回有意义的错误信息
- Service 继续运行(不崩溃)
---
## 💡 简化策略
为了快速完成迁移,建议采用 **渐进式策略**
### MVP 版本(最小可行)
**目标**:用最少的改动让服务工作
**简化点**
1. **Agent 运行**:直接调用,不追求最优性能
2. **结果提取**:简单解析(像现在一样)
3. **状态管理**:基础版本即可
**时间**:1 天
### 增强版本(生产就绪)
**目标**:优化性能和用户体验
**增强点**
1. **Agent 持久化**:复用 agent instance
2. **更好的结果解析**:提取更多信息
3. **错误恢复**:健壮的错误处理
**时间**:+1 天
### 完整版本(未来扩展)
**目标**:为高级功能做准备
**扩展点**
1. **MetricUnit 集成**
2. **Lifecycle 管理**
3. **异步 Meta-cognition**
**时间**:+1-2 周(按需)
---
## 📊 对比:迁移前 vs 迁移后
| 方面 | 迁移前 (wrapper) | 迁移后 (standalone) |
|------|-----------------|---------------------|
| **依赖** | 依赖 ev2.py | 完全独立 |
| **架构** | 两层 | 单层 |
| **状态** | 分散 | 集中 |
| **Agent** | 每次创建 | 可持久化 |
| **扩展性** | 受限 | 高 |
| **维护性** | 中等 | 高 |
| **性能** | 有开销 | 优化 |
| **代码行数** | ~700 | ~800-1000 |
---
## ✅ 验收标准
迁移完成的标准:
1. **功能完整性**
- [ ] 所有 ev2.py 的功能都保留
- [ ] HTTP API 正常工作
- [ ] 状态持久化正常
- [ ] Agent 能正确运行
2. **测试通过**
- [ ] `test_ev2_service.py` 全部通过
- [ ] 模拟 25 代演化成功
- [ ] Agent 生成 EVAL_AGENTS.md 和 auxiliary_metrics.py
3. **代码质量**
- [ ] 无 linter 错误
- [ ] 有充分的注释
- [ ] 结构清晰
4. **文档完整**
- [ ] API 文档更新
- [ ] 使用指南更新
- [ ] 迁移说明清晰
---
## 🚀 立即开始
### 第一步(今天,30 分钟)
1. 创建 `ev2_service_standalone.py` 骨架
2. 复制 `ev2_service.py` 的 HTTP 部分
3. 创建 `IntegratedEV2Agent` 类骨架
### 第二步(明天上午,2-3 小时)
1.`ev2.py` 迁移核心逻辑到 `IntegratedEV2Agent`
2. 实现 `_create_agent``_build_task_message`
3. 简化版的 `analyze_generation`
### 第三步(明天下午,2-3 小时)
1. 完整测试
2. 修复问题
3. 文档更新
---
**Ready to start?** 我可以帮你创建 `ev2_service_standalone.py` 的骨架!🚀