shinka-backup / eval_agent /design_draft /MIGRATION_VERIFICATION.md

JustinTX

Add files using upload-large-folder tool

3f6526a verified about 1 month ago

preview code

raw

history blame contribute delete

8.05 kB

EV2 Migration Verification

✅ Migration Complete!

Successfully migrated from ev2_service.py (wrapper) to ev2_service_standalone.py (integrated).

📊 Migration Summary

Component	ev2.py Location	ev2_service_standalone.py Location	Status
LLM Creation	Lines 54-58	`IntegratedEV2Agent._create_llm()`	✅ Exact replica
Agent Creation	Lines 60-73	`IntegratedEV2Agent._create_agent()`	✅ Exact replica
Task Building	Lines 104-204	`IntegratedEV2Agent._build_task_message()`	✅ Exact replica
Conversation	Line 76	`analyze_generation()`	✅ Same API usage
Send/Run	Lines 85-91	`analyze_generation()`	✅ Same API usage
Workspace	Line 41	`__init__()`	✅ Same path logic
Error Handling	Lines 130-136	`_build_task_message()`	✅ Same try-except
Print Logs	Lines 44-100	Converted to `logging`	✅ More professional

🔍 Key Differences (Improvements)

Agent Lifecycle: Agent instance can be reused (no recreation each time)
State Management: Integrated with service state
Logging: Uses Python logging instead of print
Error Handling: More robust, service doesn't crash
Configuration: Unified config system

🎯 What Was Preserved (100% Compatibility)

✅ Exact same LLM configuration (model, api_key, base_url from env vars)
✅ Exact same tools (Terminal, FileEditor, TaskTracker)
✅ Exact same prompt template (ev2_prompt.j2)
✅ Exact same task message format (all text, structure preserved)
✅ Exact same workspace path (results_dir/eval_agent_memory)
✅ Exact same file generation (EVAL_AGENTS.md, auxiliary_metrics.py)
✅ Exact same Conversation API usage

🧪 Testing Checklist

Service starts without errors
Agent initialization successful
Generation notifications work
Agent triggers at correct intervals
Agent generates EVAL_AGENTS.md
Agent generates auxiliary_metrics.py
Service state persists correctly
Manual trigger works
Error handling works (graceful failures)

🚀 Testing Instructions

Step 1: Start the Standalone Service

cd /home/tengxiao/pj/ShinkaEvolve

# Make sure old service is stopped
pkill -f "ev2_service"

# Start new standalone service
python eval_agent/ev2_service_standalone.py \
    --config eval_agent/ev2_service_config.yaml

Expected output: ```

✅ IntegratedEV2Agent Initialized

Results Dir: /path/to/results Workspace: /path/to/results/eval_agent_memory Primary Evaluator: /path/to/evaluate_ori.py

🤖 Creating LLM: vertex_ai/gemini-2.5-flash 📋 Loading prompt: /path/to/ev2_prompt.j2 ✅ Agent created ✅ Integrated EV2 Agent ready

✅ Service Started Experiment: circle_packing_NO_vision Results dir: ... Trigger mode: periodic Trigger interval: 10

INFO: Uvicorn running on http://0.0.0.0:8765


### Step 2: Test Service Status

```bash
# In another terminal
cd /home/tengxiao/pj/ShinkaEvolve

python eval_agent/test_ev2_service.py \
    --results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
    --test-mode status

Expected output:

🔍 Testing service status...
✅ Service is running!
   Uptime: X.Xs
   Trigger mode: periodic
   Trigger interval: 10

Step 3: Simulate Evolution (Small Test)

# Test with just 12 generations (will trigger once at gen 10)
python eval_agent/test_ev2_service.py \
    --results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
    --test-mode simulate \
    --num-gens 12

Expected behavior:

Gen 0-9:  → SKIP (fast, ~0.1s each)
Gen 10:   → TRIGGER (slow, ~60-240s, agent runs)
Gen 11:   → SKIP (fast)

Check outputs:

# Check service state
cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/service_state.json

# Should show:
# - total_notifications: 12
# - total_agent_runs: 1
# - last_agent_trigger_gen: 10

# Check agent outputs
ls -la examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/

# Should have:
# - EVAL_AGENTS.md (updated)
# - auxiliary_metrics.py (created/updated)
# - service_state.json (new)

Step 4: Verify Agent Output Quality

# Check that EVAL_AGENTS.md has new content
tail -50 examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/EVAL_AGENTS.md

# Check that auxiliary_metrics.py is valid Python
python -m py_compile examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/auxiliary_metrics.py

Step 5: Test Manual Trigger

curl -X POST "http://localhost:8765/api/v1/trigger/manual?generation=5"

Should trigger agent for generation 5 (if it exists in history).

🐛 Troubleshooting

Issue: "Agent not initialized"

Symptom: Service starts but agent triggers fail

Check:

# Look for this in startup logs:
# ❌ Failed to initialize agent: ...

Common causes:

Primary evaluator path wrong → Check primary_evaluator in config
LLM config wrong → Check env vars: LLM_MODEL, LLM_API_KEY
ev2_prompt.j2 missing → Check file exists in eval_agent/

Fix:

# Verify primary evaluator exists
ls -la examples/circle_packing/evaluate_ori.py

# Verify prompt exists
ls -la eval_agent/ev2_prompt.j2

# Check LLM env vars
echo $LLM_MODEL
echo $LLM_API_KEY

Issue: Agent runs but produces no output

Symptom: Agent completes but EVAL_AGENTS.md is empty or not updated

Check:

Workspace permissions
Agent logs (look for errors during run)
LLM API connectivity

Issue: Service crashes on agent trigger

Symptom: Service stops when trying to run agent

Check:

Look at full error traceback
Check if OpenHands SDK version is compatible
Verify all dependencies installed

✅ Success Criteria

The migration is successful if:

✅ Service starts without errors
✅ Agent initializes (no "Agent not initialized" errors)
✅ Agent triggers at correct generations (10, 20, 30...)
✅ Agent generates EVAL_AGENTS.md with meaningful content
✅ Agent generates auxiliary_metrics.py with valid Python code
✅ Service state persists across notifications
✅ No crashes or fatal errors during agent runs

📝 Next Steps After Verification

Once all tests pass:

Update documentation to point to standalone version
Archive old version: Rename ev2_service.py to ev2_service_wrapper_old.py
Update test scripts to use standalone by default
Integrate with ShinkaEvolve: Add notification code to EvolutionRunner
Production deployment: Add systemd service, monitoring, etc.

🎉 Migration Benefits

Performance

✅ Agent can be reused (no recreation overhead)
✅ Faster startup (agent pre-initialized)

Maintainability

✅ Single codebase (no wrapper layer)
✅ Clearer architecture
✅ Easier to debug

Extensibility

✅ Ready for MetricUnit integration
✅ Ready for Lifecycle management
✅ Ready for async meta-cognition

Reliability

✅ Better error handling
✅ Doesn't depend on subprocess calls
✅ Unified state management