shinka-backup / eval_agent /design_draft /MIGRATION_VERIFICATION.md
JustinTX's picture
Add files using upload-large-folder tool
3f6526a verified
# EV2 Migration Verification
## βœ… Migration Complete!
Successfully migrated from `ev2_service.py` (wrapper) to `ev2_service_standalone.py` (integrated).
### πŸ“Š Migration Summary
| Component | ev2.py Location | ev2_service_standalone.py Location | Status |
|-----------|----------------|-----------------------------------|--------|
| **LLM Creation** | Lines 54-58 | `IntegratedEV2Agent._create_llm()` | βœ… Exact replica |
| **Agent Creation** | Lines 60-73 | `IntegratedEV2Agent._create_agent()` | βœ… Exact replica |
| **Task Building** | Lines 104-204 | `IntegratedEV2Agent._build_task_message()` | βœ… Exact replica |
| **Conversation** | Line 76 | `analyze_generation()` | βœ… Same API usage |
| **Send/Run** | Lines 85-91 | `analyze_generation()` | βœ… Same API usage |
| **Workspace** | Line 41 | `__init__()` | βœ… Same path logic |
| **Error Handling** | Lines 130-136 | `_build_task_message()` | βœ… Same try-except |
| **Print Logs** | Lines 44-100 | Converted to `logging` | βœ… More professional |
### πŸ” Key Differences (Improvements)
1. **Agent Lifecycle**: Agent instance can be reused (no recreation each time)
2. **State Management**: Integrated with service state
3. **Logging**: Uses Python logging instead of print
4. **Error Handling**: More robust, service doesn't crash
5. **Configuration**: Unified config system
### 🎯 What Was Preserved (100% Compatibility)
1. βœ… **Exact same LLM configuration** (model, api_key, base_url from env vars)
2. βœ… **Exact same tools** (Terminal, FileEditor, TaskTracker)
3. βœ… **Exact same prompt template** (ev2_prompt.j2)
4. βœ… **Exact same task message format** (all text, structure preserved)
5. βœ… **Exact same workspace path** (results_dir/eval_agent_memory)
6. βœ… **Exact same file generation** (EVAL_AGENTS.md, auxiliary_metrics.py)
7. βœ… **Exact same Conversation API usage**
### πŸ§ͺ Testing Checklist
- [ ] Service starts without errors
- [ ] Agent initialization successful
- [ ] Generation notifications work
- [ ] Agent triggers at correct intervals
- [ ] Agent generates EVAL_AGENTS.md
- [ ] Agent generates auxiliary_metrics.py
- [ ] Service state persists correctly
- [ ] Manual trigger works
- [ ] Error handling works (graceful failures)
---
## πŸš€ Testing Instructions
### Step 1: Start the Standalone Service
```bash
cd /home/tengxiao/pj/ShinkaEvolve
# Make sure old service is stopped
pkill -f "ev2_service"
# Start new standalone service
python eval_agent/ev2_service_standalone.py \
--config eval_agent/ev2_service_config.yaml
```
**Expected output**:
```
================================================================================
βœ… IntegratedEV2Agent Initialized
================================================================================
Results Dir: /path/to/results
Workspace: /path/to/results/eval_agent_memory
Primary Evaluator: /path/to/evaluate_ori.py
================================================================================
πŸ€– Creating LLM: vertex_ai/gemini-2.5-flash
πŸ“‹ Loading prompt: /path/to/ev2_prompt.j2
βœ… Agent created
βœ… Integrated EV2 Agent ready
================================================================================
βœ… Service Started
Experiment: circle_packing_NO_vision
Results dir: ...
Trigger mode: periodic
Trigger interval: 10
================================================================================
INFO: Uvicorn running on http://0.0.0.0:8765
```
### Step 2: Test Service Status
```bash
# In another terminal
cd /home/tengxiao/pj/ShinkaEvolve
python eval_agent/test_ev2_service.py \
--results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
--test-mode status
```
**Expected output**:
```
πŸ” Testing service status...
βœ… Service is running!
Uptime: X.Xs
Trigger mode: periodic
Trigger interval: 10
```
### Step 3: Simulate Evolution (Small Test)
```bash
# Test with just 12 generations (will trigger once at gen 10)
python eval_agent/test_ev2_service.py \
--results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
--test-mode simulate \
--num-gens 12
```
**Expected behavior**:
```
Gen 0-9: β†’ SKIP (fast, ~0.1s each)
Gen 10: β†’ TRIGGER (slow, ~60-240s, agent runs)
Gen 11: β†’ SKIP (fast)
```
**Check outputs**:
```bash
# Check service state
cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/service_state.json
# Should show:
# - total_notifications: 12
# - total_agent_runs: 1
# - last_agent_trigger_gen: 10
# Check agent outputs
ls -la examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/
# Should have:
# - EVAL_AGENTS.md (updated)
# - auxiliary_metrics.py (created/updated)
# - service_state.json (new)
```
### Step 4: Verify Agent Output Quality
```bash
# Check that EVAL_AGENTS.md has new content
tail -50 examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/EVAL_AGENTS.md
# Check that auxiliary_metrics.py is valid Python
python -m py_compile examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/auxiliary_metrics.py
```
### Step 5: Test Manual Trigger
```bash
curl -X POST "http://localhost:8765/api/v1/trigger/manual?generation=5"
```
Should trigger agent for generation 5 (if it exists in history).
---
## πŸ› Troubleshooting
### Issue: "Agent not initialized"
**Symptom**: Service starts but agent triggers fail
**Check**:
```bash
# Look for this in startup logs:
# ❌ Failed to initialize agent: ...
```
**Common causes**:
1. Primary evaluator path wrong β†’ Check `primary_evaluator` in config
2. LLM config wrong β†’ Check env vars: `LLM_MODEL`, `LLM_API_KEY`
3. ev2_prompt.j2 missing β†’ Check file exists in eval_agent/
**Fix**:
```bash
# Verify primary evaluator exists
ls -la examples/circle_packing/evaluate_ori.py
# Verify prompt exists
ls -la eval_agent/ev2_prompt.j2
# Check LLM env vars
echo $LLM_MODEL
echo $LLM_API_KEY
```
### Issue: Agent runs but produces no output
**Symptom**: Agent completes but EVAL_AGENTS.md is empty or not updated
**Check**:
1. Workspace permissions
2. Agent logs (look for errors during run)
3. LLM API connectivity
### Issue: Service crashes on agent trigger
**Symptom**: Service stops when trying to run agent
**Check**:
1. Look at full error traceback
2. Check if OpenHands SDK version is compatible
3. Verify all dependencies installed
---
## βœ… Success Criteria
The migration is successful if:
1. βœ… Service starts without errors
2. βœ… Agent initializes (no "Agent not initialized" errors)
3. βœ… Agent triggers at correct generations (10, 20, 30...)
4. βœ… Agent generates EVAL_AGENTS.md with meaningful content
5. βœ… Agent generates auxiliary_metrics.py with valid Python code
6. βœ… Service state persists across notifications
7. βœ… No crashes or fatal errors during agent runs
---
## πŸ“ Next Steps After Verification
Once all tests pass:
1. **Update documentation** to point to standalone version
2. **Archive old version**: Rename `ev2_service.py` to `ev2_service_wrapper_old.py`
3. **Update test scripts** to use standalone by default
4. **Integrate with ShinkaEvolve**: Add notification code to EvolutionRunner
5. **Production deployment**: Add systemd service, monitoring, etc.
---
## πŸŽ‰ Migration Benefits
### Performance
- βœ… Agent can be reused (no recreation overhead)
- βœ… Faster startup (agent pre-initialized)
### Maintainability
- βœ… Single codebase (no wrapper layer)
- βœ… Clearer architecture
- βœ… Easier to debug
### Extensibility
- βœ… Ready for MetricUnit integration
- βœ… Ready for Lifecycle management
- βœ… Ready for async meta-cognition
### Reliability
- βœ… Better error handling
- βœ… Doesn't depend on subprocess calls
- βœ… Unified state management