shinka-backup / eval_agent /design_draft /MIGRATION_VERIFICATION.md

Add files using upload-large-folder tool

3f6526a verified about 1 month ago

8.05 kB

	# EV2 Migration Verification

	## ✅ Migration Complete!

	Successfully migrated from `ev2_service.py` (wrapper) to `ev2_service_standalone.py` (integrated).

	### 📊 Migration Summary

	\| Component \| ev2.py Location \| ev2_service_standalone.py Location \| Status \|
	\|-----------\|----------------\|-----------------------------------\|--------\|
	\| LLM Creation \| Lines 54-58 \| `IntegratedEV2Agent._create_llm()` \| ✅ Exact replica \|
	\| Agent Creation \| Lines 60-73 \| `IntegratedEV2Agent._create_agent()` \| ✅ Exact replica \|
	\| Task Building \| Lines 104-204 \| `IntegratedEV2Agent._build_task_message()` \| ✅ Exact replica \|
	\| Conversation \| Line 76 \| `analyze_generation()` \| ✅ Same API usage \|
	\| Send/Run \| Lines 85-91 \| `analyze_generation()` \| ✅ Same API usage \|
	\| Workspace \| Line 41 \| `__init__()` \| ✅ Same path logic \|
	\| Error Handling \| Lines 130-136 \| `_build_task_message()` \| ✅ Same try-except \|
	\| Print Logs \| Lines 44-100 \| Converted to `logging` \| ✅ More professional \|

	### 🔍 Key Differences (Improvements)

	1. Agent Lifecycle: Agent instance can be reused (no recreation each time)
	2. State Management: Integrated with service state
	3. Logging: Uses Python logging instead of print
	4. Error Handling: More robust, service doesn't crash
	5. Configuration: Unified config system

	### 🎯 What Was Preserved (100% Compatibility)

	1. ✅ Exact same LLM configuration (model, api_key, base_url from env vars)
	2. ✅ Exact same tools (Terminal, FileEditor, TaskTracker)
	3. ✅ Exact same prompt template (ev2_prompt.j2)
	4. ✅ Exact same task message format (all text, structure preserved)
	5. ✅ Exact same workspace path (results_dir/eval_agent_memory)
	6. ✅ Exact same file generation (EVAL_AGENTS.md, auxiliary_metrics.py)
	7. ✅ Exact same Conversation API usage

	### 🧪 Testing Checklist

	- [ ] Service starts without errors
	- [ ] Agent initialization successful
	- [ ] Generation notifications work
	- [ ] Agent triggers at correct intervals
	- [ ] Agent generates EVAL_AGENTS.md
	- [ ] Agent generates auxiliary_metrics.py
	- [ ] Service state persists correctly
	- [ ] Manual trigger works
	- [ ] Error handling works (graceful failures)

	---

	## 🚀 Testing Instructions

	### Step 1: Start the Standalone Service

	```bash
	cd /home/tengxiao/pj/ShinkaEvolve

	# Make sure old service is stopped
	pkill -f "ev2_service"

	# Start new standalone service
	python eval_agent/ev2_service_standalone.py \
	--config eval_agent/ev2_service_config.yaml
	```

	Expected output:
	```
	================================================================================
	✅ IntegratedEV2Agent Initialized
	================================================================================
	Results Dir: /path/to/results
	Workspace: /path/to/results/eval_agent_memory
	Primary Evaluator: /path/to/evaluate_ori.py
	================================================================================
	🤖 Creating LLM: vertex_ai/gemini-2.5-flash
	📋 Loading prompt: /path/to/ev2_prompt.j2
	✅ Agent created
	✅ Integrated EV2 Agent ready
	================================================================================
	✅ Service Started
	Experiment: circle_packing_NO_vision
	Results dir: ...
	Trigger mode: periodic
	Trigger interval: 10
	================================================================================
	INFO: Uvicorn running on http://0.0.0.0:8765
	```

	### Step 2: Test Service Status

	```bash
	# In another terminal
	cd /home/tengxiao/pj/ShinkaEvolve

	python eval_agent/test_ev2_service.py \
	--results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
	--test-mode status
	```

	Expected output:
	```
	🔍 Testing service status...
	✅ Service is running!
	Uptime: X.Xs
	Trigger mode: periodic
	Trigger interval: 10
	```

	### Step 3: Simulate Evolution (Small Test)

	```bash
	# Test with just 12 generations (will trigger once at gen 10)
	python eval_agent/test_ev2_service.py \
	--results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
	--test-mode simulate \
	--num-gens 12
	```

	Expected behavior:
	```
	Gen 0-9: → SKIP (fast, ~0.1s each)
	Gen 10: → TRIGGER (slow, ~60-240s, agent runs)
	Gen 11: → SKIP (fast)
	```

	Check outputs:
	```bash
	# Check service state
	cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/service_state.json

	# Should show:
	# - total_notifications: 12
	# - total_agent_runs: 1
	# - last_agent_trigger_gen: 10

	# Check agent outputs
	ls -la examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/

	# Should have:
	# - EVAL_AGENTS.md (updated)
	# - auxiliary_metrics.py (created/updated)
	# - service_state.json (new)
	```

	### Step 4: Verify Agent Output Quality

	```bash
	# Check that EVAL_AGENTS.md has new content
	tail -50 examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/EVAL_AGENTS.md

	# Check that auxiliary_metrics.py is valid Python
	python -m py_compile examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/auxiliary_metrics.py
	```

	### Step 5: Test Manual Trigger

	```bash
	curl -X POST "http://localhost:8765/api/v1/trigger/manual?generation=5"
	```

	Should trigger agent for generation 5 (if it exists in history).

	---

	## 🐛 Troubleshooting

	### Issue: "Agent not initialized"

	Symptom: Service starts but agent triggers fail

	Check:
	```bash
	# Look for this in startup logs:
	# ❌ Failed to initialize agent: ...
	```

	Common causes:
	1. Primary evaluator path wrong → Check `primary_evaluator` in config
	2. LLM config wrong → Check env vars: `LLM_MODEL`, `LLM_API_KEY`
	3. ev2_prompt.j2 missing → Check file exists in eval_agent/

	Fix:
	```bash
	# Verify primary evaluator exists
	ls -la examples/circle_packing/evaluate_ori.py

	# Verify prompt exists
	ls -la eval_agent/ev2_prompt.j2

	# Check LLM env vars
	echo $LLM_MODEL
	echo $LLM_API_KEY
	```

	### Issue: Agent runs but produces no output

	Symptom: Agent completes but EVAL_AGENTS.md is empty or not updated

	Check:
	1. Workspace permissions
	2. Agent logs (look for errors during run)
	3. LLM API connectivity

	### Issue: Service crashes on agent trigger

	Symptom: Service stops when trying to run agent

	Check:
	1. Look at full error traceback
	2. Check if OpenHands SDK version is compatible
	3. Verify all dependencies installed

	---

	## ✅ Success Criteria

	The migration is successful if:

	1. ✅ Service starts without errors
	2. ✅ Agent initializes (no "Agent not initialized" errors)
	3. ✅ Agent triggers at correct generations (10, 20, 30...)
	4. ✅ Agent generates EVAL_AGENTS.md with meaningful content
	5. ✅ Agent generates auxiliary_metrics.py with valid Python code
	6. ✅ Service state persists across notifications
	7. ✅ No crashes or fatal errors during agent runs

	---

	## 📝 Next Steps After Verification

	Once all tests pass:

	1. Update documentation to point to standalone version
	2. Archive old version: Rename `ev2_service.py` to `ev2_service_wrapper_old.py`
	3. Update test scripts to use standalone by default
	4. Integrate with ShinkaEvolve: Add notification code to EvolutionRunner
	5. Production deployment: Add systemd service, monitoring, etc.

	---

	## 🎉 Migration Benefits

	### Performance
	- ✅ Agent can be reused (no recreation overhead)
	- ✅ Faster startup (agent pre-initialized)

	### Maintainability
	- ✅ Single codebase (no wrapper layer)
	- ✅ Clearer architecture
	- ✅ Easier to debug

	### Extensibility
	- ✅ Ready for MetricUnit integration
	- ✅ Ready for Lifecycle management
	- ✅ Ready for async meta-cognition

	### Reliability
	- ✅ Better error handling
	- ✅ Doesn't depend on subprocess calls
	- ✅ Unified state management