EV2 Service Integration - Testing Guide
🎯 测试策略
我们采用渐进式测试策略,确保每一步都验证正确:
Phase 1: 基础功能 ✓
└─ 配置加载、方法存在
Phase 2: 基础设施 ← 当前阶段
├─ 服务健康检查
├─ 通知机制
└─ 无副作用验证
Phase 3: 结果一致性
├─ 无 service 运行(baseline)
├─ 有 service(passive mode)
└─ 对比结果(应该完全相同)
Phase 4: 完整集成
└─ 启用 agent,验证辅助指标生成
📋 Phase 1: 基础功能测试 ✅
目标: 验证代码修改正确,不破坏现有功能
运行测试
cd /home/tengxiao/pj/ShinkaEvolve
uv run eval_agent/test_integration_basic.py
预期结果
============================================================
EV2 Service Integration - Basic Tests
============================================================
Test 1: Backward compatibility (default config)...
✅ Default config: eval_service_url=None
Test 2: Enable eval service...
✅ Config with service: eval_service_url='http://localhost:8765'
Test 3: Set via kwargs...
✅ Kwargs config works correctly
Test 4: _notify_eval_service method exists...
✅ _notify_eval_service method exists
- Parameters: ['self', 'generation', 'combined_score', 'results_dir']
============================================================
✅ All basic integration tests passed!
============================================================
✅ 已完成!
📋 Phase 2: 基础设施测试(Infrastructure)
目标: 验证通知机制工作,但不触发 agent(无副作用)
Step 1: 启动 Service(Passive Mode)
# Terminal 1
cd /home/tengxiao/pj/ShinkaEvolve
# 使用 passive 配置(不会触发 agent)
uv run eval_agent/ev2_service_standalone.py \
--config eval_agent/ev2_service_config_passive.yaml
Passive Mode 特点:
- ✅ 接收通知
- ✅ 记录状态
- ❌ 不触发 agent(interval=999999)
- ✅ 零副作用
Step 2: 运行基础设施测试
# Terminal 2
cd /home/tengxiao/pj/ShinkaEvolve
uv run eval_agent/test_integration_step_by_step.py
预期结果
======================================================================
🧪 EV2 SERVICE INTEGRATION - STEP BY STEP TESTING
======================================================================
============================================================
TEST 1: Service Health Check
============================================================
✅ Service is running
Status: ready
Generations processed: 0
============================================================
TEST 2: Notification Mechanism
============================================================
✅ Notification sent successfully
Response: {
"status": "received",
"generation": 1,
...
}
============================================================
TEST 3: Service State After Notifications
============================================================
✅ Service state retrieved
Total generations: 1
Agent triggered: 0 times ← 关键:不触发 agent
Last generation: 1
============================================================
TEST 4: Mini Evolution WITHOUT Service (Baseline)
============================================================
📁 Results dir: /tmp/test_shinka_baseline
🚀 Starting evolution (3 generations)...
✅ Evolution runner initialized successfully
- eval_service_url: None
- results_dir: /tmp/test_shinka_baseline
============================================================
TEST 5: Mini Evolution WITH Service (Should be Identical)
============================================================
📁 Results dir: /tmp/test_shinka_with_service
🚀 Starting evolution (3 generations)...
✅ Evolution runner initialized successfully
- eval_service_url: http://localhost:8765
- results_dir: /tmp/test_shinka_with_service
✅ Service URL correctly configured
======================================================================
📊 TEST SUMMARY
======================================================================
✅ PASS Service Health
✅ PASS Notification Mechanism
✅ PASS Service State Check
✅ PASS Evolution WITHOUT Service
✅ PASS Evolution WITH Service
======================================================================
🎉 All tests passed! Integration is working correctly.
======================================================================
验证要点
- ✅ Service 接收通知
- ✅
agent_triggered_count = 0(没有触发) - ✅ 两种模式初始化都成功
- ✅ 配置正确传递
📋 Phase 3: 结果一致性测试
目标: 验证有/无 service 的演化结果完全相同
Step 1: 准备测试实验
选择一个已知的、可复现的实验:
# test_consistency.py
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig
def run_experiment(with_service=False, run_id="baseline"):
"""Run a small experiment."""
results_dir = f"/tmp/consistency_test_{run_id}"
evo_config = EvolutionConfig(
num_generations=10, # Small but meaningful
max_parallel_jobs=2,
results_dir=results_dir,
# ... your actual config ...
eval_service_url="http://localhost:8765" if with_service else None
)
# ... rest of your config ...
runner = EvolutionRunner(evo_config, job_config, db_config)
runner.run()
return results_dir
# Run both
baseline_dir = run_experiment(with_service=False, run_id="baseline")
with_service_dir = run_experiment(with_service=True, run_id="with_service")
print(f"Baseline: {baseline_dir}")
print(f"With service: {with_service_dir}")
Step 2: 运行实验
# Terminal 1: Service (passive mode)
uv run eval_agent/ev2_service_standalone.py \
--config eval_agent/ev2_service_config_passive.yaml
# Terminal 2: Run experiments
uv run test_consistency.py
Step 3: 对比结果
# Compare database
sqlite3 /tmp/consistency_test_baseline/evolution.db \
"SELECT generation, combined_score FROM programs ORDER BY generation"
sqlite3 /tmp/consistency_test_with_service/evolution.db \
"SELECT generation, combined_score FROM programs ORDER BY generation"
# Should be IDENTICAL (or very close due to randomness)
预期结果
- ✅ 两个实验的
combined_score轨迹相同(如果固定随机种子) - ✅ 程序数量相同
- ✅ 运行时间相近(差异 < 1%)
- ✅ Service 日志显示收到通知但未触发 agent
📋 Phase 4: 完整集成测试
目标: 启用 agent,验证辅助指标生成
Step 1: 配置 Agent 触发
# 编辑 eval_agent/ev2_service_config.yaml
# 设置合理的触发间隔
trigger_strategy:
type: "periodic"
interval: 5 # 每 5 代触发一次
Step 2: 准备 Primary Evaluator
确保你的主评估器路径正确:
primary_evaluator:
path: "/home/tengxiao/pj/ShinkaEvolve/examples/circle_packing/evaluate_ori.py"
Step 3: 启动 Service(Active Mode)
# Terminal 1
uv run eval_agent/ev2_service_standalone.py \
--config eval_agent/ev2_service_config.yaml
Step 4: 运行实验
# Terminal 2
uv run my/experiment_with_eval_service.py
预期行为
Generation 1-4:
Service: ✅ Generation 1 completed (score: 0.50)
Service: ⏳ Not triggering (interval=5, current=1)
Service: ✅ Generation 2 completed (score: 0.52)
Service: ⏳ Not triggering (interval=5, current=2)
...
Generation 5:
Service: ✅ Generation 5 completed (score: 0.58)
Service: 🎯 Trigger condition met (periodic: interval=5)
Service: 🤖 Launching agent...
Agent: 📊 Analyzing 5 generations...
Agent: 🔍 Reading primary evaluator...
Agent: 💡 Generating auxiliary metrics...
Agent: ✅ Created aux_metrics.py
Service: ✅ Agent completed in 45.2s
Service: 📄 Analysis saved to eval_agent_memory/EVAL_AGENTS.md
Generation 6-9:
Service: ⏳ Not triggering...
Generation 10:
Service: 🎯 Trigger condition met
Service: 🤖 Launching agent...
...
验证输出
# 检查 agent 输出
ls -la results_dir/eval_agent_memory/
# 应该看到:
# - EVAL_AGENTS.md
# - aux_metrics.py
# - workspace/
# 查看分析报告
cat results_dir/eval_agent_memory/EVAL_AGENTS.md
# 验证辅助指标
python -m py_compile results_dir/eval_agent_memory/aux_metrics.py
🧪 完整测试脚本(真实实验)
使用现有的 Circle Packing 实验
# eval_agent/test_real_integration.py
"""
Real integration test using Circle Packing example.
"""
import sys
import shutil
from pathlib import Path
# Your existing imports
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig
def run_circle_packing_test(with_eval_service=False):
"""
Run circle packing with/without eval service.
Args:
with_eval_service: Enable eval service integration
"""
# Results directory
suffix = "with_service" if with_eval_service else "baseline"
results_dir = Path(f"/tmp/circle_packing_integration_test_{suffix}")
# Clean previous run
if results_dir.exists():
shutil.rmtree(results_dir)
results_dir.mkdir(parents=True)
print("=" * 60)
print(f"Running Circle Packing {'WITH' if with_eval_service else 'WITHOUT'} Eval Service")
print(f"Results: {results_dir}")
print("=" * 60)
# Configuration
evolution_config = EvolutionConfig(
num_generations=10, # Small for testing
max_parallel_jobs=2,
results_dir=str(results_dir),
init_program_path="examples/circle_packing/initial.py",
# Eval service (conditional)
eval_service_url="http://localhost:8765" if with_eval_service else None,
# ... rest of your config ...
)
job_config = LocalJobConfig(
eval_program_path="examples/circle_packing/evaluate_ori.py",
)
db_config = DatabaseConfig()
# Run
runner = EvolutionRunner(
evo_config=evolution_config,
job_config=job_config,
db_config=db_config,
verbose=True
)
runner.run()
print(f"\n✅ Completed: {results_dir}")
return results_dir
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"--mode",
choices=["baseline", "with-service", "both"],
default="baseline",
help="Test mode"
)
args = parser.parse_args()
if args.mode in ["baseline", "both"]:
baseline_dir = run_circle_packing_test(with_eval_service=False)
print(f"\n📊 Baseline results: {baseline_dir}")
if args.mode in ["with-service", "both"]:
service_dir = run_circle_packing_test(with_eval_service=True)
print(f"\n📊 With-service results: {service_dir}")
# Check for agent output
agent_memory = Path(service_dir) / "eval_agent_memory"
if agent_memory.exists():
print(f"\n✅ Agent memory found:")
for f in agent_memory.iterdir():
print(f" - {f.name}")
else:
print(f"\n⚠️ No agent memory (agent not triggered yet?)")
运行完整测试
# Terminal 1: Service (active mode, interval=5)
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml
# Terminal 2: Baseline only
uv run eval_agent/test_real_integration.py --mode baseline
# Terminal 2: With service only
uv run eval_agent/test_real_integration.py --mode with-service
# Terminal 2: Both (for comparison)
uv run eval_agent/test_real_integration.py --mode both
✅ 验证检查清单
Phase 2: 基础设施
- Service 启动成功(passive mode)
- 通知发送成功
- Service 接收通知
-
agent_triggered_count = 0(passive mode) - 有/无 service 的初始化都成功
Phase 3: 结果一致性
- Baseline 实验完成
- With-service 实验完成
- 两者的
combined_score轨迹相同/相近 - 运行时间差异 < 1%
- Service 日志显示收到所有通知
Phase 4: 完整集成
- Service 启动(active mode)
- Agent 在预期代数触发(gen 5, 10, ...)
-
EVAL_AGENTS.md生成 -
aux_metrics.py生成且语法正确 - Primary metric 未被修改
- Evolution 正常完成
🐛 故障排除
Service 收不到通知
检查:
# Service 是否运行?
curl http://localhost:8765/api/v1/status
# 检查 runner.py 日志
grep "Notified eval service" results_dir/evolution_run.log
grep "Failed to notify eval service" results_dir/evolution_run.log
通知发送但无响应
可能原因:
- Service 崩溃了(检查 Terminal 1)
- 端口被占用(检查
netstat -tuln | grep 8765) - 网络问题(防火墙?)
Agent 不触发
检查:
- Service 模式:
ev2_service_config.yaml还是ev2_service_config_passive.yaml? - Interval 设置:是否太大(999999)?
- Generation 数量:是否少于 interval?
结果不一致
正常情况:
- 有随机性的演化:结果略有不同
- LLM 调用:每次可能不同
异常情况:
- Score 差异 > 10%:检查是否 agent 修改了 primary evaluator
- 运行时间差异 > 5%:检查网络延迟或超时
📊 当前进度
✅ Phase 1: 基础功能测试(已完成)
🔄 Phase 2: 基础设施测试(进行中)
⏳ Phase 3: 结果一致性测试
⏳ Phase 4: 完整集成测试
下一步: 运行 Phase 2 测试
# Terminal 1
uv run eval_agent/ev2_service_standalone.py \
--config eval_agent/ev2_service_config_passive.yaml
# Terminal 2
uv run eval_agent/test_integration_step_by_step.py