| # EV2 Service Integration - Testing Guide |
|
|
| ## 🎯 测试策略 |
|
|
| 我们采用**渐进式测试**策略,确保每一步都验证正确: |
|
|
| ``` |
| Phase 1: 基础功能 ✓ |
| └─ 配置加载、方法存在 |
| |
| Phase 2: 基础设施 ← 当前阶段 |
| ├─ 服务健康检查 |
| ├─ 通知机制 |
| └─ 无副作用验证 |
| |
| Phase 3: 结果一致性 |
| ├─ 无 service 运行(baseline) |
| ├─ 有 service(passive mode) |
| └─ 对比结果(应该完全相同) |
| |
| Phase 4: 完整集成 |
| └─ 启用 agent,验证辅助指标生成 |
| ``` |
|
|
| --- |
|
|
| ## 📋 Phase 1: 基础功能测试 ✅ |
|
|
| **目标**: 验证代码修改正确,不破坏现有功能 |
|
|
| ### 运行测试 |
|
|
| ```bash |
| cd /home/tengxiao/pj/ShinkaEvolve |
| uv run eval_agent/test_integration_basic.py |
| ``` |
|
|
| ### 预期结果 |
|
|
| ``` |
| ============================================================ |
| EV2 Service Integration - Basic Tests |
| ============================================================ |
| Test 1: Backward compatibility (default config)... |
| ✅ Default config: eval_service_url=None |
|
|
| Test 2: Enable eval service... |
| ✅ Config with service: eval_service_url='http://localhost:8765' |
|
|
| Test 3: Set via kwargs... |
| ✅ Kwargs config works correctly |
|
|
| Test 4: _notify_eval_service method exists... |
| ✅ _notify_eval_service method exists |
| - Parameters: ['self', 'generation', 'combined_score', 'results_dir'] |
|
|
| ============================================================ |
| ✅ All basic integration tests passed! |
| ============================================================ |
| ``` |
| |
| **✅ 已完成!** |
| |
| --- |
| |
| ## 📋 Phase 2: 基础设施测试(Infrastructure) |
| |
| **目标**: 验证通知机制工作,但不触发 agent(无副作用) |
| |
| ### Step 1: 启动 Service(Passive Mode) |
| |
| ```bash |
| # Terminal 1 |
| cd /home/tengxiao/pj/ShinkaEvolve |
|
|
| # 使用 passive 配置(不会触发 agent) |
| uv run eval_agent/ev2_service_standalone.py \ |
| --config eval_agent/ev2_service_config_passive.yaml |
| ``` |
| |
| **Passive Mode 特点:** |
| - ✅ 接收通知 |
| - ✅ 记录状态 |
| - ❌ 不触发 agent(interval=999999) |
| - ✅ 零副作用 |
| |
| ### Step 2: 运行基础设施测试 |
| |
| ```bash |
| # Terminal 2 |
| cd /home/tengxiao/pj/ShinkaEvolve |
| uv run eval_agent/test_integration_step_by_step.py |
| ``` |
| |
| ### 预期结果 |
| |
| ``` |
| ====================================================================== |
| 🧪 EV2 SERVICE INTEGRATION - STEP BY STEP TESTING |
| ====================================================================== |
|
|
| ============================================================ |
| TEST 1: Service Health Check |
| ============================================================ |
| ✅ Service is running |
| Status: ready |
| Generations processed: 0 |
|
|
| ============================================================ |
| TEST 2: Notification Mechanism |
| ============================================================ |
| ✅ Notification sent successfully |
| Response: { |
| "status": "received", |
| "generation": 1, |
| ... |
| } |
| |
| ============================================================ |
| TEST 3: Service State After Notifications |
| ============================================================ |
| ✅ Service state retrieved |
| Total generations: 1 |
| Agent triggered: 0 times ← 关键:不触发 agent |
| Last generation: 1 |
|
|
| ============================================================ |
| TEST 4: Mini Evolution WITHOUT Service (Baseline) |
| ============================================================ |
| 📁 Results dir: /tmp/test_shinka_baseline |
| 🚀 Starting evolution (3 generations)... |
| ✅ Evolution runner initialized successfully |
| - eval_service_url: None |
| - results_dir: /tmp/test_shinka_baseline |
| |
| ============================================================ |
| TEST 5: Mini Evolution WITH Service (Should be Identical) |
| ============================================================ |
| 📁 Results dir: /tmp/test_shinka_with_service |
| 🚀 Starting evolution (3 generations)... |
| ✅ Evolution runner initialized successfully |
| - eval_service_url: http://localhost:8765 |
| - results_dir: /tmp/test_shinka_with_service |
| ✅ Service URL correctly configured |
|
|
| ====================================================================== |
| 📊 TEST SUMMARY |
| ====================================================================== |
| ✅ PASS Service Health |
| ✅ PASS Notification Mechanism |
| ✅ PASS Service State Check |
| ✅ PASS Evolution WITHOUT Service |
| ✅ PASS Evolution WITH Service |
| ====================================================================== |
| 🎉 All tests passed! Integration is working correctly. |
| ====================================================================== |
| ``` |
| |
| ### 验证要点 |
| |
| - ✅ Service 接收通知 |
| - ✅ `agent_triggered_count = 0`(没有触发) |
| - ✅ 两种模式初始化都成功 |
| - ✅ 配置正确传递 |
| |
| --- |
| |
| ## 📋 Phase 3: 结果一致性测试 |
| |
| **目标**: 验证有/无 service 的演化结果完全相同 |
| |
| ### Step 1: 准备测试实验 |
| |
| 选择一个**已知的、可复现的**实验: |
| |
| ```python |
| # test_consistency.py |
| from shinka.core import EvolutionRunner, EvolutionConfig |
| from shinka.launch import LocalJobConfig |
| from shinka.database import DatabaseConfig |
| |
| def run_experiment(with_service=False, run_id="baseline"): |
| """Run a small experiment.""" |
| |
| results_dir = f"/tmp/consistency_test_{run_id}" |
| |
| evo_config = EvolutionConfig( |
| num_generations=10, # Small but meaningful |
| max_parallel_jobs=2, |
| results_dir=results_dir, |
| # ... your actual config ... |
| eval_service_url="http://localhost:8765" if with_service else None |
| ) |
| |
| # ... rest of your config ... |
| |
| runner = EvolutionRunner(evo_config, job_config, db_config) |
| runner.run() |
| |
| return results_dir |
| |
| # Run both |
| baseline_dir = run_experiment(with_service=False, run_id="baseline") |
| with_service_dir = run_experiment(with_service=True, run_id="with_service") |
|
|
| print(f"Baseline: {baseline_dir}") |
| print(f"With service: {with_service_dir}") |
| ``` |
| |
| ### Step 2: 运行实验 |
| |
| ```bash |
| # Terminal 1: Service (passive mode) |
| uv run eval_agent/ev2_service_standalone.py \ |
| --config eval_agent/ev2_service_config_passive.yaml |
|
|
| # Terminal 2: Run experiments |
| uv run test_consistency.py |
| ``` |
| |
| ### Step 3: 对比结果 |
| |
| ```bash |
| # Compare database |
| sqlite3 /tmp/consistency_test_baseline/evolution.db \ |
| "SELECT generation, combined_score FROM programs ORDER BY generation" |
|
|
| sqlite3 /tmp/consistency_test_with_service/evolution.db \ |
| "SELECT generation, combined_score FROM programs ORDER BY generation" |
|
|
| # Should be IDENTICAL (or very close due to randomness) |
| ``` |
| |
| ### 预期结果 |
| |
| - ✅ 两个实验的 `combined_score` 轨迹相同(如果固定随机种子) |
| - ✅ 程序数量相同 |
| - ✅ 运行时间相近(差异 < 1%) |
| - ✅ Service 日志显示收到通知但未触发 agent |
| |
| --- |
| |
| ## 📋 Phase 4: 完整集成测试 |
| |
| **目标**: 启用 agent,验证辅助指标生成 |
| |
| ### Step 1: 配置 Agent 触发 |
| |
| ```bash |
| # 编辑 eval_agent/ev2_service_config.yaml |
| # 设置合理的触发间隔 |
| ``` |
| |
| ```yaml |
| trigger_strategy: |
| type: "periodic" |
| interval: 5 # 每 5 代触发一次 |
| ``` |
| |
| ### Step 2: 准备 Primary Evaluator |
| |
| 确保你的主评估器路径正确: |
| |
| ```yaml |
| primary_evaluator: |
| path: "/home/tengxiao/pj/ShinkaEvolve/examples/circle_packing/evaluate_ori.py" |
| ``` |
| |
| ### Step 3: 启动 Service(Active Mode) |
| |
| ```bash |
| # Terminal 1 |
| uv run eval_agent/ev2_service_standalone.py \ |
| --config eval_agent/ev2_service_config.yaml |
| ``` |
| |
| ### Step 4: 运行实验 |
| |
| ```bash |
| # Terminal 2 |
| uv run my/experiment_with_eval_service.py |
| ``` |
| |
| ### 预期行为 |
| |
| **Generation 1-4:** |
| ``` |
| Service: ✅ Generation 1 completed (score: 0.50) |
| Service: ⏳ Not triggering (interval=5, current=1) |
| Service: ✅ Generation 2 completed (score: 0.52) |
| Service: ⏳ Not triggering (interval=5, current=2) |
| ... |
| ``` |
| |
| **Generation 5:** |
| ``` |
| Service: ✅ Generation 5 completed (score: 0.58) |
| Service: 🎯 Trigger condition met (periodic: interval=5) |
| Service: 🤖 Launching agent... |
| Agent: 📊 Analyzing 5 generations... |
| Agent: 🔍 Reading primary evaluator... |
| Agent: 💡 Generating auxiliary metrics... |
| Agent: ✅ Created aux_metrics.py |
| Service: ✅ Agent completed in 45.2s |
| Service: 📄 Analysis saved to eval_agent_memory/EVAL_AGENTS.md |
| ``` |
| |
| **Generation 6-9:** |
| ``` |
| Service: ⏳ Not triggering... |
| ``` |
| |
| **Generation 10:** |
| ``` |
| Service: 🎯 Trigger condition met |
| Service: 🤖 Launching agent... |
| ... |
| ``` |
| |
| ### 验证输出 |
| |
| ```bash |
| # 检查 agent 输出 |
| ls -la results_dir/eval_agent_memory/ |
| # 应该看到: |
| # - EVAL_AGENTS.md |
| # - aux_metrics.py |
| # - workspace/ |
| |
| # 查看分析报告 |
| cat results_dir/eval_agent_memory/EVAL_AGENTS.md |
| |
| # 验证辅助指标 |
| python -m py_compile results_dir/eval_agent_memory/aux_metrics.py |
| ``` |
| |
| --- |
| |
| ## 🧪 完整测试脚本(真实实验) |
| |
| ### 使用现有的 Circle Packing 实验 |
| |
| ```python |
| # eval_agent/test_real_integration.py |
| """ |
| Real integration test using Circle Packing example. |
| """ |
| |
| import sys |
| import shutil |
| from pathlib import Path |
| |
| # Your existing imports |
| from shinka.core import EvolutionRunner, EvolutionConfig |
| from shinka.launch import LocalJobConfig |
| from shinka.database import DatabaseConfig |
| |
| def run_circle_packing_test(with_eval_service=False): |
| """ |
| Run circle packing with/without eval service. |
| |
| Args: |
| with_eval_service: Enable eval service integration |
| """ |
| |
| # Results directory |
| suffix = "with_service" if with_eval_service else "baseline" |
| results_dir = Path(f"/tmp/circle_packing_integration_test_{suffix}") |
| |
| # Clean previous run |
| if results_dir.exists(): |
| shutil.rmtree(results_dir) |
| results_dir.mkdir(parents=True) |
| |
| print("=" * 60) |
| print(f"Running Circle Packing {'WITH' if with_eval_service else 'WITHOUT'} Eval Service") |
| print(f"Results: {results_dir}") |
| print("=" * 60) |
| |
| # Configuration |
| evolution_config = EvolutionConfig( |
| num_generations=10, # Small for testing |
| max_parallel_jobs=2, |
| results_dir=str(results_dir), |
| init_program_path="examples/circle_packing/initial.py", |
| |
| # Eval service (conditional) |
| eval_service_url="http://localhost:8765" if with_eval_service else None, |
| |
| # ... rest of your config ... |
| ) |
| |
| job_config = LocalJobConfig( |
| eval_program_path="examples/circle_packing/evaluate_ori.py", |
| ) |
| |
| db_config = DatabaseConfig() |
| |
| # Run |
| runner = EvolutionRunner( |
| evo_config=evolution_config, |
| job_config=job_config, |
| db_config=db_config, |
| verbose=True |
| ) |
| |
| runner.run() |
| |
| print(f"\n✅ Completed: {results_dir}") |
| return results_dir |
| |
|
|
| if __name__ == "__main__": |
| import argparse |
| |
| parser = argparse.ArgumentParser() |
| parser.add_argument( |
| "--mode", |
| choices=["baseline", "with-service", "both"], |
| default="baseline", |
| help="Test mode" |
| ) |
| args = parser.parse_args() |
| |
| if args.mode in ["baseline", "both"]: |
| baseline_dir = run_circle_packing_test(with_eval_service=False) |
| print(f"\n📊 Baseline results: {baseline_dir}") |
| |
| if args.mode in ["with-service", "both"]: |
| service_dir = run_circle_packing_test(with_eval_service=True) |
| print(f"\n📊 With-service results: {service_dir}") |
| |
| # Check for agent output |
| agent_memory = Path(service_dir) / "eval_agent_memory" |
| if agent_memory.exists(): |
| print(f"\n✅ Agent memory found:") |
| for f in agent_memory.iterdir(): |
| print(f" - {f.name}") |
| else: |
| print(f"\n⚠️ No agent memory (agent not triggered yet?)") |
| ``` |
| |
| ### 运行完整测试 |
|
|
| ```bash |
| # Terminal 1: Service (active mode, interval=5) |
| uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml |
| |
| # Terminal 2: Baseline only |
| uv run eval_agent/test_real_integration.py --mode baseline |
| |
| # Terminal 2: With service only |
| uv run eval_agent/test_real_integration.py --mode with-service |
| |
| # Terminal 2: Both (for comparison) |
| uv run eval_agent/test_real_integration.py --mode both |
| ``` |
|
|
| --- |
|
|
| ## ✅ 验证检查清单 |
|
|
| ### Phase 2: 基础设施 |
|
|
| - [ ] Service 启动成功(passive mode) |
| - [ ] 通知发送成功 |
| - [ ] Service 接收通知 |
| - [ ] `agent_triggered_count = 0`(passive mode) |
| - [ ] 有/无 service 的初始化都成功 |
|
|
| ### Phase 3: 结果一致性 |
|
|
| - [ ] Baseline 实验完成 |
| - [ ] With-service 实验完成 |
| - [ ] 两者的 `combined_score` 轨迹相同/相近 |
| - [ ] 运行时间差异 < 1% |
| - [ ] Service 日志显示收到所有通知 |
|
|
| ### Phase 4: 完整集成 |
|
|
| - [ ] Service 启动(active mode) |
| - [ ] Agent 在预期代数触发(gen 5, 10, ...) |
| - [ ] `EVAL_AGENTS.md` 生成 |
| - [ ] `aux_metrics.py` 生成且语法正确 |
| - [ ] Primary metric 未被修改 |
| - [ ] Evolution 正常完成 |
|
|
| --- |
|
|
| ## 🐛 故障排除 |
|
|
| ### Service 收不到通知 |
|
|
| **检查:** |
| ```bash |
| # Service 是否运行? |
| curl http://localhost:8765/api/v1/status |
| |
| # 检查 runner.py 日志 |
| grep "Notified eval service" results_dir/evolution_run.log |
| grep "Failed to notify eval service" results_dir/evolution_run.log |
| ``` |
|
|
| ### 通知发送但无响应 |
|
|
| **可能原因:** |
| - Service 崩溃了(检查 Terminal 1) |
| - 端口被占用(检查 `netstat -tuln | grep 8765`) |
| - 网络问题(防火墙?) |
|
|
| ### Agent 不触发 |
|
|
| **检查:** |
| 1. Service 模式:`ev2_service_config.yaml` 还是 `ev2_service_config_passive.yaml`? |
| 2. Interval 设置:是否太大(999999)? |
| 3. Generation 数量:是否少于 interval? |
|
|
| ### 结果不一致 |
|
|
| **正常情况:** |
| - 有随机性的演化:结果略有不同 |
| - LLM 调用:每次可能不同 |
|
|
| **异常情况:** |
| - Score 差异 > 10%:检查是否 agent 修改了 primary evaluator |
| - 运行时间差异 > 5%:检查网络延迟或超时 |
|
|
| --- |
|
|
| ## 📊 当前进度 |
|
|
| ``` |
| ✅ Phase 1: 基础功能测试(已完成) |
| 🔄 Phase 2: 基础设施测试(进行中) |
| ⏳ Phase 3: 结果一致性测试 |
| ⏳ Phase 4: 完整集成测试 |
| ``` |
|
|
| **下一步**: 运行 Phase 2 测试 |
|
|
| ```bash |
| # Terminal 1 |
| uv run eval_agent/ev2_service_standalone.py \ |
| --config eval_agent/ev2_service_config_passive.yaml |
| |
| # Terminal 2 |
| uv run eval_agent/test_integration_step_by_step.py |
| ``` |
|
|