Add files using upload-large-folder tool

3f6526a verified 28 days ago

14.2 kB

	# EV2 Service Integration - Testing Guide

	## 🎯 测试策略

	我们采用渐进式测试策略，确保每一步都验证正确：

	```
	Phase 1: 基础功能 ✓
	└─ 配置加载、方法存在

	Phase 2: 基础设施 ← 当前阶段
	├─ 服务健康检查
	├─ 通知机制
	└─ 无副作用验证

	Phase 3: 结果一致性
	├─ 无 service 运行（baseline）
	├─ 有 service（passive mode）
	└─ 对比结果（应该完全相同）

	Phase 4: 完整集成
	└─ 启用 agent，验证辅助指标生成
	```

	---

	## 📋 Phase 1: 基础功能测试 ✅

	目标: 验证代码修改正确，不破坏现有功能

	### 运行测试

	```bash
	cd /home/tengxiao/pj/ShinkaEvolve
	uv run eval_agent/test_integration_basic.py
	```

	### 预期结果

	```
	============================================================
	EV2 Service Integration - Basic Tests
	============================================================
	Test 1: Backward compatibility (default config)...
	✅ Default config: eval_service_url=None

	Test 2: Enable eval service...
	✅ Config with service: eval_service_url='http://localhost:8765'

	Test 3: Set via kwargs...
	✅ Kwargs config works correctly

	Test 4: _notify_eval_service method exists...
	✅ _notify_eval_service method exists
	- Parameters: ['self', 'generation', 'combined_score', 'results_dir']

	============================================================
	✅ All basic integration tests passed!
	============================================================
	```

	✅ 已完成！

	---

	## 📋 Phase 2: 基础设施测试（Infrastructure）

	目标: 验证通知机制工作，但不触发 agent（无副作用）

	### Step 1: 启动 Service（Passive Mode）

	```bash
	# Terminal 1
	cd /home/tengxiao/pj/ShinkaEvolve

	# 使用 passive 配置（不会触发 agent）
	uv run eval_agent/ev2_service_standalone.py \
	--config eval_agent/ev2_service_config_passive.yaml
	```

	Passive Mode 特点:
	- ✅ 接收通知
	- ✅ 记录状态
	- ❌ 不触发 agent（interval=999999）
	- ✅ 零副作用

	### Step 2: 运行基础设施测试

	```bash
	# Terminal 2
	cd /home/tengxiao/pj/ShinkaEvolve
	uv run eval_agent/test_integration_step_by_step.py
	```

	### 预期结果

	```
	======================================================================
	🧪 EV2 SERVICE INTEGRATION - STEP BY STEP TESTING
	======================================================================

	============================================================
	TEST 1: Service Health Check
	============================================================
	✅ Service is running
	Status: ready
	Generations processed: 0

	============================================================
	TEST 2: Notification Mechanism
	============================================================
	✅ Notification sent successfully
	Response: {
	"status": "received",
	"generation": 1,
	...
	}

	============================================================
	TEST 3: Service State After Notifications
	============================================================
	✅ Service state retrieved
	Total generations: 1
	Agent triggered: 0 times ← 关键：不触发 agent
	Last generation: 1

	============================================================
	TEST 4: Mini Evolution WITHOUT Service (Baseline)
	============================================================
	📁 Results dir: /tmp/test_shinka_baseline
	🚀 Starting evolution (3 generations)...
	✅ Evolution runner initialized successfully
	- eval_service_url: None
	- results_dir: /tmp/test_shinka_baseline

	============================================================
	TEST 5: Mini Evolution WITH Service (Should be Identical)
	============================================================
	📁 Results dir: /tmp/test_shinka_with_service
	🚀 Starting evolution (3 generations)...
	✅ Evolution runner initialized successfully
	- eval_service_url: http://localhost:8765
	- results_dir: /tmp/test_shinka_with_service
	✅ Service URL correctly configured

	======================================================================
	📊 TEST SUMMARY
	======================================================================
	✅ PASS Service Health
	✅ PASS Notification Mechanism
	✅ PASS Service State Check
	✅ PASS Evolution WITHOUT Service
	✅ PASS Evolution WITH Service
	======================================================================
	🎉 All tests passed! Integration is working correctly.
	======================================================================
	```

	### 验证要点

	- ✅ Service 接收通知
	- ✅ `agent_triggered_count = 0`（没有触发）
	- ✅ 两种模式初始化都成功
	- ✅ 配置正确传递

	---

	## 📋 Phase 3: 结果一致性测试

	目标: 验证有/无 service 的演化结果完全相同

	### Step 1: 准备测试实验

	选择一个已知的、可复现的实验：

	```python
	# test_consistency.py
	from shinka.core import EvolutionRunner, EvolutionConfig
	from shinka.launch import LocalJobConfig
	from shinka.database import DatabaseConfig

	def run_experiment(with_service=False, run_id="baseline"):
	"""Run a small experiment."""

	results_dir = f"/tmp/consistency_test_{run_id}"

	evo_config = EvolutionConfig(
	num_generations=10, # Small but meaningful
	max_parallel_jobs=2,
	results_dir=results_dir,
	# ... your actual config ...
	eval_service_url="http://localhost:8765" if with_service else None
	)

	# ... rest of your config ...

	runner = EvolutionRunner(evo_config, job_config, db_config)
	runner.run()

	return results_dir

	# Run both
	baseline_dir = run_experiment(with_service=False, run_id="baseline")
	with_service_dir = run_experiment(with_service=True, run_id="with_service")

	print(f"Baseline: {baseline_dir}")
	print(f"With service: {with_service_dir}")
	```

	### Step 2: 运行实验

	```bash
	# Terminal 1: Service (passive mode)
	uv run eval_agent/ev2_service_standalone.py \
	--config eval_agent/ev2_service_config_passive.yaml

	# Terminal 2: Run experiments
	uv run test_consistency.py
	```

	### Step 3: 对比结果

	```bash
	# Compare database
	sqlite3 /tmp/consistency_test_baseline/evolution.db \
	"SELECT generation, combined_score FROM programs ORDER BY generation"

	sqlite3 /tmp/consistency_test_with_service/evolution.db \
	"SELECT generation, combined_score FROM programs ORDER BY generation"

	# Should be IDENTICAL (or very close due to randomness)
	```

	### 预期结果

	- ✅ 两个实验的 `combined_score` 轨迹相同（如果固定随机种子）
	- ✅ 程序数量相同
	- ✅ 运行时间相近（差异 < 1%）
	- ✅ Service 日志显示收到通知但未触发 agent

	---

	## 📋 Phase 4: 完整集成测试

	目标: 启用 agent，验证辅助指标生成

	### Step 1: 配置 Agent 触发

	```bash
	# 编辑 eval_agent/ev2_service_config.yaml
	# 设置合理的触发间隔
	```

	```yaml
	trigger_strategy:
	type: "periodic"
	interval: 5 # 每 5 代触发一次
	```

	### Step 2: 准备 Primary Evaluator

	确保你的主评估器路径正确：

	```yaml
	primary_evaluator:
	path: "/home/tengxiao/pj/ShinkaEvolve/examples/circle_packing/evaluate_ori.py"
	```

	### Step 3: 启动 Service（Active Mode）

	```bash
	# Terminal 1
	uv run eval_agent/ev2_service_standalone.py \
	--config eval_agent/ev2_service_config.yaml
	```

	### Step 4: 运行实验

	```bash
	# Terminal 2
	uv run my/experiment_with_eval_service.py
	```

	### 预期行为

	Generation 1-4:
	```
	Service: ✅ Generation 1 completed (score: 0.50)
	Service: ⏳ Not triggering (interval=5, current=1)
	Service: ✅ Generation 2 completed (score: 0.52)
	Service: ⏳ Not triggering (interval=5, current=2)
	...
	```

	Generation 5:
	```
	Service: ✅ Generation 5 completed (score: 0.58)
	Service: 🎯 Trigger condition met (periodic: interval=5)
	Service: 🤖 Launching agent...
	Agent: 📊 Analyzing 5 generations...
	Agent: 🔍 Reading primary evaluator...
	Agent: 💡 Generating auxiliary metrics...
	Agent: ✅ Created aux_metrics.py
	Service: ✅ Agent completed in 45.2s
	Service: 📄 Analysis saved to eval_agent_memory/EVAL_AGENTS.md
	```

	Generation 6-9:
	```
	Service: ⏳ Not triggering...
	```

	Generation 10:
	```
	Service: 🎯 Trigger condition met
	Service: 🤖 Launching agent...
	...
	```

	### 验证输出

	```bash
	# 检查 agent 输出
	ls -la results_dir/eval_agent_memory/
	# 应该看到:
	# - EVAL_AGENTS.md
	# - aux_metrics.py
	# - workspace/

	# 查看分析报告
	cat results_dir/eval_agent_memory/EVAL_AGENTS.md

	# 验证辅助指标
	python -m py_compile results_dir/eval_agent_memory/aux_metrics.py
	```

	---

	## 🧪 完整测试脚本（真实实验）

	### 使用现有的 Circle Packing 实验

	```python
	# eval_agent/test_real_integration.py
	"""
	Real integration test using Circle Packing example.
	"""

	import sys
	import shutil
	from pathlib import Path

	# Your existing imports
	from shinka.core import EvolutionRunner, EvolutionConfig
	from shinka.launch import LocalJobConfig
	from shinka.database import DatabaseConfig

	def run_circle_packing_test(with_eval_service=False):
	"""
	Run circle packing with/without eval service.

	Args:
	with_eval_service: Enable eval service integration
	"""

	# Results directory
	suffix = "with_service" if with_eval_service else "baseline"
	results_dir = Path(f"/tmp/circle_packing_integration_test_{suffix}")

	# Clean previous run
	if results_dir.exists():
	shutil.rmtree(results_dir)
	results_dir.mkdir(parents=True)

	print("=" * 60)
	print(f"Running Circle Packing {'WITH' if with_eval_service else 'WITHOUT'} Eval Service")
	print(f"Results: {results_dir}")
	print("=" * 60)

	# Configuration
	evolution_config = EvolutionConfig(
	num_generations=10, # Small for testing
	max_parallel_jobs=2,
	results_dir=str(results_dir),
	init_program_path="examples/circle_packing/initial.py",

	# Eval service (conditional)
	eval_service_url="http://localhost:8765" if with_eval_service else None,

	# ... rest of your config ...
	)

	job_config = LocalJobConfig(
	eval_program_path="examples/circle_packing/evaluate_ori.py",
	)

	db_config = DatabaseConfig()

	# Run
	runner = EvolutionRunner(
	evo_config=evolution_config,
	job_config=job_config,
	db_config=db_config,
	verbose=True
	)

	runner.run()

	print(f"\n✅ Completed: {results_dir}")
	return results_dir


	if __name__ == "__main__":
	import argparse

	parser = argparse.ArgumentParser()
	parser.add_argument(
	"--mode",
	choices=["baseline", "with-service", "both"],
	default="baseline",
	help="Test mode"
	)
	args = parser.parse_args()

	if args.mode in ["baseline", "both"]:
	baseline_dir = run_circle_packing_test(with_eval_service=False)
	print(f"\n📊 Baseline results: {baseline_dir}")

	if args.mode in ["with-service", "both"]:
	service_dir = run_circle_packing_test(with_eval_service=True)
	print(f"\n📊 With-service results: {service_dir}")

	# Check for agent output
	agent_memory = Path(service_dir) / "eval_agent_memory"
	if agent_memory.exists():
	print(f"\n✅ Agent memory found:")
	for f in agent_memory.iterdir():
	print(f" - {f.name}")
	else:
	print(f"\n⚠️ No agent memory (agent not triggered yet?)")
	```

	### 运行完整测试

	```bash
	# Terminal 1: Service (active mode, interval=5)
	uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml

	# Terminal 2: Baseline only
	uv run eval_agent/test_real_integration.py --mode baseline

	# Terminal 2: With service only
	uv run eval_agent/test_real_integration.py --mode with-service

	# Terminal 2: Both (for comparison)
	uv run eval_agent/test_real_integration.py --mode both
	```

	---

	## ✅ 验证检查清单

	### Phase 2: 基础设施

	- [ ] Service 启动成功（passive mode）
	- [ ] 通知发送成功
	- [ ] Service 接收通知
	- [ ] `agent_triggered_count = 0`（passive mode）
	- [ ] 有/无 service 的初始化都成功

	### Phase 3: 结果一致性

	- [ ] Baseline 实验完成
	- [ ] With-service 实验完成
	- [ ] 两者的 `combined_score` 轨迹相同/相近
	- [ ] 运行时间差异 < 1%
	- [ ] Service 日志显示收到所有通知

	### Phase 4: 完整集成

	- [ ] Service 启动（active mode）
	- [ ] Agent 在预期代数触发（gen 5, 10, ...）
	- [ ] `EVAL_AGENTS.md` 生成
	- [ ] `aux_metrics.py` 生成且语法正确
	- [ ] Primary metric 未被修改
	- [ ] Evolution 正常完成

	---

	## 🐛 故障排除

	### Service 收不到通知

	检查:
	```bash
	# Service 是否运行？
	curl http://localhost:8765/api/v1/status

	# 检查 runner.py 日志
	grep "Notified eval service" results_dir/evolution_run.log
	grep "Failed to notify eval service" results_dir/evolution_run.log
	```

	### 通知发送但无响应

	可能原因:
	- Service 崩溃了（检查 Terminal 1）
	- 端口被占用（检查 `netstat -tuln \| grep 8765`）
	- 网络问题（防火墙？）

	### Agent 不触发

	检查:
	1. Service 模式：`ev2_service_config.yaml` 还是 `ev2_service_config_passive.yaml`？
	2. Interval 设置：是否太大（999999）？
	3. Generation 数量：是否少于 interval？

	### 结果不一致

	正常情况:
	- 有随机性的演化：结果略有不同
	- LLM 调用：每次可能不同

	异常情况:
	- Score 差异 > 10%：检查是否 agent 修改了 primary evaluator
	- 运行时间差异 > 5%：检查网络延迟或超时

	---

	## 📊 当前进度

	```
	✅ Phase 1: 基础功能测试（已完成）
	🔄 Phase 2: 基础设施测试（进行中）
	⏳ Phase 3: 结果一致性测试
	⏳ Phase 4: 完整集成测试
	```

	下一步: 运行 Phase 2 测试

	```bash
	# Terminal 1
	uv run eval_agent/ev2_service_standalone.py \
	--config eval_agent/ev2_service_config_passive.yaml

	# Terminal 2
	uv run eval_agent/test_integration_step_by_step.py
	```