shinka-backup / eval_agent /design_draft /MIGRATION_VERIFICATION.md
JustinTX's picture
Add files using upload-large-folder tool
3f6526a verified

EV2 Migration Verification

βœ… Migration Complete!

Successfully migrated from ev2_service.py (wrapper) to ev2_service_standalone.py (integrated).

πŸ“Š Migration Summary

Component ev2.py Location ev2_service_standalone.py Location Status
LLM Creation Lines 54-58 IntegratedEV2Agent._create_llm() βœ… Exact replica
Agent Creation Lines 60-73 IntegratedEV2Agent._create_agent() βœ… Exact replica
Task Building Lines 104-204 IntegratedEV2Agent._build_task_message() βœ… Exact replica
Conversation Line 76 analyze_generation() βœ… Same API usage
Send/Run Lines 85-91 analyze_generation() βœ… Same API usage
Workspace Line 41 __init__() βœ… Same path logic
Error Handling Lines 130-136 _build_task_message() βœ… Same try-except
Print Logs Lines 44-100 Converted to logging βœ… More professional

πŸ” Key Differences (Improvements)

  1. Agent Lifecycle: Agent instance can be reused (no recreation each time)
  2. State Management: Integrated with service state
  3. Logging: Uses Python logging instead of print
  4. Error Handling: More robust, service doesn't crash
  5. Configuration: Unified config system

🎯 What Was Preserved (100% Compatibility)

  1. βœ… Exact same LLM configuration (model, api_key, base_url from env vars)
  2. βœ… Exact same tools (Terminal, FileEditor, TaskTracker)
  3. βœ… Exact same prompt template (ev2_prompt.j2)
  4. βœ… Exact same task message format (all text, structure preserved)
  5. βœ… Exact same workspace path (results_dir/eval_agent_memory)
  6. βœ… Exact same file generation (EVAL_AGENTS.md, auxiliary_metrics.py)
  7. βœ… Exact same Conversation API usage

πŸ§ͺ Testing Checklist

  • Service starts without errors
  • Agent initialization successful
  • Generation notifications work
  • Agent triggers at correct intervals
  • Agent generates EVAL_AGENTS.md
  • Agent generates auxiliary_metrics.py
  • Service state persists correctly
  • Manual trigger works
  • Error handling works (graceful failures)

πŸš€ Testing Instructions

Step 1: Start the Standalone Service

cd /home/tengxiao/pj/ShinkaEvolve

# Make sure old service is stopped
pkill -f "ev2_service"

# Start new standalone service
python eval_agent/ev2_service_standalone.py \
    --config eval_agent/ev2_service_config.yaml

Expected output: ```

βœ… IntegratedEV2Agent Initialized

Results Dir: /path/to/results Workspace: /path/to/results/eval_agent_memory Primary Evaluator: /path/to/evaluate_ori.py

πŸ€– Creating LLM: vertex_ai/gemini-2.5-flash πŸ“‹ Loading prompt: /path/to/ev2_prompt.j2 βœ… Agent created βœ… Integrated EV2 Agent ready

βœ… Service Started Experiment: circle_packing_NO_vision Results dir: ... Trigger mode: periodic Trigger interval: 10

INFO: Uvicorn running on http://0.0.0.0:8765


### Step 2: Test Service Status

```bash
# In another terminal
cd /home/tengxiao/pj/ShinkaEvolve

python eval_agent/test_ev2_service.py \
    --results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
    --test-mode status

Expected output:

πŸ” Testing service status...
βœ… Service is running!
   Uptime: X.Xs
   Trigger mode: periodic
   Trigger interval: 10

Step 3: Simulate Evolution (Small Test)

# Test with just 12 generations (will trigger once at gen 10)
python eval_agent/test_ev2_service.py \
    --results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
    --test-mode simulate \
    --num-gens 12

Expected behavior:

Gen 0-9:  β†’ SKIP (fast, ~0.1s each)
Gen 10:   β†’ TRIGGER (slow, ~60-240s, agent runs)
Gen 11:   β†’ SKIP (fast)

Check outputs:

# Check service state
cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/service_state.json

# Should show:
# - total_notifications: 12
# - total_agent_runs: 1
# - last_agent_trigger_gen: 10

# Check agent outputs
ls -la examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/

# Should have:
# - EVAL_AGENTS.md (updated)
# - auxiliary_metrics.py (created/updated)
# - service_state.json (new)

Step 4: Verify Agent Output Quality

# Check that EVAL_AGENTS.md has new content
tail -50 examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/EVAL_AGENTS.md

# Check that auxiliary_metrics.py is valid Python
python -m py_compile examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/auxiliary_metrics.py

Step 5: Test Manual Trigger

curl -X POST "http://localhost:8765/api/v1/trigger/manual?generation=5"

Should trigger agent for generation 5 (if it exists in history).


πŸ› Troubleshooting

Issue: "Agent not initialized"

Symptom: Service starts but agent triggers fail

Check:

# Look for this in startup logs:
# ❌ Failed to initialize agent: ...

Common causes:

  1. Primary evaluator path wrong β†’ Check primary_evaluator in config
  2. LLM config wrong β†’ Check env vars: LLM_MODEL, LLM_API_KEY
  3. ev2_prompt.j2 missing β†’ Check file exists in eval_agent/

Fix:

# Verify primary evaluator exists
ls -la examples/circle_packing/evaluate_ori.py

# Verify prompt exists
ls -la eval_agent/ev2_prompt.j2

# Check LLM env vars
echo $LLM_MODEL
echo $LLM_API_KEY

Issue: Agent runs but produces no output

Symptom: Agent completes but EVAL_AGENTS.md is empty or not updated

Check:

  1. Workspace permissions
  2. Agent logs (look for errors during run)
  3. LLM API connectivity

Issue: Service crashes on agent trigger

Symptom: Service stops when trying to run agent

Check:

  1. Look at full error traceback
  2. Check if OpenHands SDK version is compatible
  3. Verify all dependencies installed

βœ… Success Criteria

The migration is successful if:

  1. βœ… Service starts without errors
  2. βœ… Agent initializes (no "Agent not initialized" errors)
  3. βœ… Agent triggers at correct generations (10, 20, 30...)
  4. βœ… Agent generates EVAL_AGENTS.md with meaningful content
  5. βœ… Agent generates auxiliary_metrics.py with valid Python code
  6. βœ… Service state persists across notifications
  7. βœ… No crashes or fatal errors during agent runs

πŸ“ Next Steps After Verification

Once all tests pass:

  1. Update documentation to point to standalone version
  2. Archive old version: Rename ev2_service.py to ev2_service_wrapper_old.py
  3. Update test scripts to use standalone by default
  4. Integrate with ShinkaEvolve: Add notification code to EvolutionRunner
  5. Production deployment: Add systemd service, monitoring, etc.

πŸŽ‰ Migration Benefits

Performance

  • βœ… Agent can be reused (no recreation overhead)
  • βœ… Faster startup (agent pre-initialized)

Maintainability

  • βœ… Single codebase (no wrapper layer)
  • βœ… Clearer architecture
  • βœ… Easier to debug

Extensibility

  • βœ… Ready for MetricUnit integration
  • βœ… Ready for Lifecycle management
  • βœ… Ready for async meta-cognition

Reliability

  • βœ… Better error handling
  • βœ… Doesn't depend on subprocess calls
  • βœ… Unified state management