# EV2 Service Integration - Testing Guide ## 🎯 测试策略 我们采用**渐进式测试**策略,确保每一步都验证正确: ``` Phase 1: 基础功能 ✓ └─ 配置加载、方法存在 Phase 2: 基础设施 ← 当前阶段 ├─ 服务健康检查 ├─ 通知机制 └─ 无副作用验证 Phase 3: 结果一致性 ├─ 无 service 运行(baseline) ├─ 有 service(passive mode) └─ 对比结果(应该完全相同) Phase 4: 完整集成 └─ 启用 agent,验证辅助指标生成 ``` --- ## 📋 Phase 1: 基础功能测试 ✅ **目标**: 验证代码修改正确,不破坏现有功能 ### 运行测试 ```bash cd /home/tengxiao/pj/ShinkaEvolve uv run eval_agent/test_integration_basic.py ``` ### 预期结果 ``` ============================================================ EV2 Service Integration - Basic Tests ============================================================ Test 1: Backward compatibility (default config)... ✅ Default config: eval_service_url=None Test 2: Enable eval service... ✅ Config with service: eval_service_url='http://localhost:8765' Test 3: Set via kwargs... ✅ Kwargs config works correctly Test 4: _notify_eval_service method exists... ✅ _notify_eval_service method exists - Parameters: ['self', 'generation', 'combined_score', 'results_dir'] ============================================================ ✅ All basic integration tests passed! ============================================================ ``` **✅ 已完成!** --- ## 📋 Phase 2: 基础设施测试(Infrastructure) **目标**: 验证通知机制工作,但不触发 agent(无副作用) ### Step 1: 启动 Service(Passive Mode) ```bash # Terminal 1 cd /home/tengxiao/pj/ShinkaEvolve # 使用 passive 配置(不会触发 agent) uv run eval_agent/ev2_service_standalone.py \ --config eval_agent/ev2_service_config_passive.yaml ``` **Passive Mode 特点:** - ✅ 接收通知 - ✅ 记录状态 - ❌ 不触发 agent(interval=999999) - ✅ 零副作用 ### Step 2: 运行基础设施测试 ```bash # Terminal 2 cd /home/tengxiao/pj/ShinkaEvolve uv run eval_agent/test_integration_step_by_step.py ``` ### 预期结果 ``` ====================================================================== 🧪 EV2 SERVICE INTEGRATION - STEP BY STEP TESTING ====================================================================== ============================================================ TEST 1: Service Health Check ============================================================ ✅ Service is running Status: ready Generations processed: 0 ============================================================ TEST 2: Notification Mechanism ============================================================ ✅ Notification sent successfully Response: { "status": "received", "generation": 1, ... } ============================================================ TEST 3: Service State After Notifications ============================================================ ✅ Service state retrieved Total generations: 1 Agent triggered: 0 times ← 关键:不触发 agent Last generation: 1 ============================================================ TEST 4: Mini Evolution WITHOUT Service (Baseline) ============================================================ 📁 Results dir: /tmp/test_shinka_baseline 🚀 Starting evolution (3 generations)... ✅ Evolution runner initialized successfully - eval_service_url: None - results_dir: /tmp/test_shinka_baseline ============================================================ TEST 5: Mini Evolution WITH Service (Should be Identical) ============================================================ 📁 Results dir: /tmp/test_shinka_with_service 🚀 Starting evolution (3 generations)... ✅ Evolution runner initialized successfully - eval_service_url: http://localhost:8765 - results_dir: /tmp/test_shinka_with_service ✅ Service URL correctly configured ====================================================================== 📊 TEST SUMMARY ====================================================================== ✅ PASS Service Health ✅ PASS Notification Mechanism ✅ PASS Service State Check ✅ PASS Evolution WITHOUT Service ✅ PASS Evolution WITH Service ====================================================================== 🎉 All tests passed! Integration is working correctly. ====================================================================== ``` ### 验证要点 - ✅ Service 接收通知 - ✅ `agent_triggered_count = 0`(没有触发) - ✅ 两种模式初始化都成功 - ✅ 配置正确传递 --- ## 📋 Phase 3: 结果一致性测试 **目标**: 验证有/无 service 的演化结果完全相同 ### Step 1: 准备测试实验 选择一个**已知的、可复现的**实验: ```python # test_consistency.py from shinka.core import EvolutionRunner, EvolutionConfig from shinka.launch import LocalJobConfig from shinka.database import DatabaseConfig def run_experiment(with_service=False, run_id="baseline"): """Run a small experiment.""" results_dir = f"/tmp/consistency_test_{run_id}" evo_config = EvolutionConfig( num_generations=10, # Small but meaningful max_parallel_jobs=2, results_dir=results_dir, # ... your actual config ... eval_service_url="http://localhost:8765" if with_service else None ) # ... rest of your config ... runner = EvolutionRunner(evo_config, job_config, db_config) runner.run() return results_dir # Run both baseline_dir = run_experiment(with_service=False, run_id="baseline") with_service_dir = run_experiment(with_service=True, run_id="with_service") print(f"Baseline: {baseline_dir}") print(f"With service: {with_service_dir}") ``` ### Step 2: 运行实验 ```bash # Terminal 1: Service (passive mode) uv run eval_agent/ev2_service_standalone.py \ --config eval_agent/ev2_service_config_passive.yaml # Terminal 2: Run experiments uv run test_consistency.py ``` ### Step 3: 对比结果 ```bash # Compare database sqlite3 /tmp/consistency_test_baseline/evolution.db \ "SELECT generation, combined_score FROM programs ORDER BY generation" sqlite3 /tmp/consistency_test_with_service/evolution.db \ "SELECT generation, combined_score FROM programs ORDER BY generation" # Should be IDENTICAL (or very close due to randomness) ``` ### 预期结果 - ✅ 两个实验的 `combined_score` 轨迹相同(如果固定随机种子) - ✅ 程序数量相同 - ✅ 运行时间相近(差异 < 1%) - ✅ Service 日志显示收到通知但未触发 agent --- ## 📋 Phase 4: 完整集成测试 **目标**: 启用 agent,验证辅助指标生成 ### Step 1: 配置 Agent 触发 ```bash # 编辑 eval_agent/ev2_service_config.yaml # 设置合理的触发间隔 ``` ```yaml trigger_strategy: type: "periodic" interval: 5 # 每 5 代触发一次 ``` ### Step 2: 准备 Primary Evaluator 确保你的主评估器路径正确: ```yaml primary_evaluator: path: "/home/tengxiao/pj/ShinkaEvolve/examples/circle_packing/evaluate_ori.py" ``` ### Step 3: 启动 Service(Active Mode) ```bash # Terminal 1 uv run eval_agent/ev2_service_standalone.py \ --config eval_agent/ev2_service_config.yaml ``` ### Step 4: 运行实验 ```bash # Terminal 2 uv run my/experiment_with_eval_service.py ``` ### 预期行为 **Generation 1-4:** ``` Service: ✅ Generation 1 completed (score: 0.50) Service: ⏳ Not triggering (interval=5, current=1) Service: ✅ Generation 2 completed (score: 0.52) Service: ⏳ Not triggering (interval=5, current=2) ... ``` **Generation 5:** ``` Service: ✅ Generation 5 completed (score: 0.58) Service: 🎯 Trigger condition met (periodic: interval=5) Service: 🤖 Launching agent... Agent: 📊 Analyzing 5 generations... Agent: 🔍 Reading primary evaluator... Agent: 💡 Generating auxiliary metrics... Agent: ✅ Created aux_metrics.py Service: ✅ Agent completed in 45.2s Service: 📄 Analysis saved to eval_agent_memory/EVAL_AGENTS.md ``` **Generation 6-9:** ``` Service: ⏳ Not triggering... ``` **Generation 10:** ``` Service: 🎯 Trigger condition met Service: 🤖 Launching agent... ... ``` ### 验证输出 ```bash # 检查 agent 输出 ls -la results_dir/eval_agent_memory/ # 应该看到: # - EVAL_AGENTS.md # - aux_metrics.py # - workspace/ # 查看分析报告 cat results_dir/eval_agent_memory/EVAL_AGENTS.md # 验证辅助指标 python -m py_compile results_dir/eval_agent_memory/aux_metrics.py ``` --- ## 🧪 完整测试脚本(真实实验) ### 使用现有的 Circle Packing 实验 ```python # eval_agent/test_real_integration.py """ Real integration test using Circle Packing example. """ import sys import shutil from pathlib import Path # Your existing imports from shinka.core import EvolutionRunner, EvolutionConfig from shinka.launch import LocalJobConfig from shinka.database import DatabaseConfig def run_circle_packing_test(with_eval_service=False): """ Run circle packing with/without eval service. Args: with_eval_service: Enable eval service integration """ # Results directory suffix = "with_service" if with_eval_service else "baseline" results_dir = Path(f"/tmp/circle_packing_integration_test_{suffix}") # Clean previous run if results_dir.exists(): shutil.rmtree(results_dir) results_dir.mkdir(parents=True) print("=" * 60) print(f"Running Circle Packing {'WITH' if with_eval_service else 'WITHOUT'} Eval Service") print(f"Results: {results_dir}") print("=" * 60) # Configuration evolution_config = EvolutionConfig( num_generations=10, # Small for testing max_parallel_jobs=2, results_dir=str(results_dir), init_program_path="examples/circle_packing/initial.py", # Eval service (conditional) eval_service_url="http://localhost:8765" if with_eval_service else None, # ... rest of your config ... ) job_config = LocalJobConfig( eval_program_path="examples/circle_packing/evaluate_ori.py", ) db_config = DatabaseConfig() # Run runner = EvolutionRunner( evo_config=evolution_config, job_config=job_config, db_config=db_config, verbose=True ) runner.run() print(f"\n✅ Completed: {results_dir}") return results_dir if __name__ == "__main__": import argparse parser = argparse.ArgumentParser() parser.add_argument( "--mode", choices=["baseline", "with-service", "both"], default="baseline", help="Test mode" ) args = parser.parse_args() if args.mode in ["baseline", "both"]: baseline_dir = run_circle_packing_test(with_eval_service=False) print(f"\n📊 Baseline results: {baseline_dir}") if args.mode in ["with-service", "both"]: service_dir = run_circle_packing_test(with_eval_service=True) print(f"\n📊 With-service results: {service_dir}") # Check for agent output agent_memory = Path(service_dir) / "eval_agent_memory" if agent_memory.exists(): print(f"\n✅ Agent memory found:") for f in agent_memory.iterdir(): print(f" - {f.name}") else: print(f"\n⚠️ No agent memory (agent not triggered yet?)") ``` ### 运行完整测试 ```bash # Terminal 1: Service (active mode, interval=5) uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml # Terminal 2: Baseline only uv run eval_agent/test_real_integration.py --mode baseline # Terminal 2: With service only uv run eval_agent/test_real_integration.py --mode with-service # Terminal 2: Both (for comparison) uv run eval_agent/test_real_integration.py --mode both ``` --- ## ✅ 验证检查清单 ### Phase 2: 基础设施 - [ ] Service 启动成功(passive mode) - [ ] 通知发送成功 - [ ] Service 接收通知 - [ ] `agent_triggered_count = 0`(passive mode) - [ ] 有/无 service 的初始化都成功 ### Phase 3: 结果一致性 - [ ] Baseline 实验完成 - [ ] With-service 实验完成 - [ ] 两者的 `combined_score` 轨迹相同/相近 - [ ] 运行时间差异 < 1% - [ ] Service 日志显示收到所有通知 ### Phase 4: 完整集成 - [ ] Service 启动(active mode) - [ ] Agent 在预期代数触发(gen 5, 10, ...) - [ ] `EVAL_AGENTS.md` 生成 - [ ] `aux_metrics.py` 生成且语法正确 - [ ] Primary metric 未被修改 - [ ] Evolution 正常完成 --- ## 🐛 故障排除 ### Service 收不到通知 **检查:** ```bash # Service 是否运行? curl http://localhost:8765/api/v1/status # 检查 runner.py 日志 grep "Notified eval service" results_dir/evolution_run.log grep "Failed to notify eval service" results_dir/evolution_run.log ``` ### 通知发送但无响应 **可能原因:** - Service 崩溃了(检查 Terminal 1) - 端口被占用(检查 `netstat -tuln | grep 8765`) - 网络问题(防火墙?) ### Agent 不触发 **检查:** 1. Service 模式:`ev2_service_config.yaml` 还是 `ev2_service_config_passive.yaml`? 2. Interval 设置:是否太大(999999)? 3. Generation 数量:是否少于 interval? ### 结果不一致 **正常情况:** - 有随机性的演化:结果略有不同 - LLM 调用:每次可能不同 **异常情况:** - Score 差异 > 10%:检查是否 agent 修改了 primary evaluator - 运行时间差异 > 5%:检查网络延迟或超时 --- ## 📊 当前进度 ``` ✅ Phase 1: 基础功能测试(已完成) 🔄 Phase 2: 基础设施测试(进行中) ⏳ Phase 3: 结果一致性测试 ⏳ Phase 4: 完整集成测试 ``` **下一步**: 运行 Phase 2 测试 ```bash # Terminal 1 uv run eval_agent/ev2_service_standalone.py \ --config eval_agent/ev2_service_config_passive.yaml # Terminal 2 uv run eval_agent/test_integration_step_by_step.py ```