File size: 14,196 Bytes

3f6526a

# EV2 Service Integration - Testing Guide

## 🎯 测试策略

我们采用**渐进式测试**策略，确保每一步都验证正确：

```
Phase 1: 基础功能 ✓
  └─ 配置加载、方法存在

Phase 2: 基础设施 ← 当前阶段
  ├─ 服务健康检查
  ├─ 通知机制
  └─ 无副作用验证

Phase 3: 结果一致性
  ├─ 无 service 运行（baseline）
  ├─ 有 service（passive mode）
  └─ 对比结果（应该完全相同）

Phase 4: 完整集成
  └─ 启用 agent，验证辅助指标生成
```

---

## 📋 Phase 1: 基础功能测试 ✅

**目标**: 验证代码修改正确，不破坏现有功能

### 运行测试

```bash
cd /home/tengxiao/pj/ShinkaEvolve
uv run eval_agent/test_integration_basic.py
```

### 预期结果

```
============================================================
EV2 Service Integration - Basic Tests
============================================================
Test 1: Backward compatibility (default config)...
  ✅ Default config: eval_service_url=None

Test 2: Enable eval service...
  ✅ Config with service: eval_service_url='http://localhost:8765'

Test 3: Set via kwargs...
  ✅ Kwargs config works correctly

Test 4: _notify_eval_service method exists...
  ✅ _notify_eval_service method exists
     - Parameters: ['self', 'generation', 'combined_score', 'results_dir']

============================================================
✅ All basic integration tests passed!
============================================================
```

**✅ 已完成！**

---

## 📋 Phase 2: 基础设施测试（Infrastructure）

**目标**: 验证通知机制工作，但不触发 agent（无副作用）

### Step 1: 启动 Service（Passive Mode）

```bash
# Terminal 1
cd /home/tengxiao/pj/ShinkaEvolve

# 使用 passive 配置（不会触发 agent）
uv run eval_agent/ev2_service_standalone.py \
  --config eval_agent/ev2_service_config_passive.yaml
```

**Passive Mode 特点:**
- ✅ 接收通知
- ✅ 记录状态
- ❌ 不触发 agent（interval=999999）
- ✅ 零副作用

### Step 2: 运行基础设施测试

```bash
# Terminal 2
cd /home/tengxiao/pj/ShinkaEvolve
uv run eval_agent/test_integration_step_by_step.py
```

### 预期结果

```
======================================================================
🧪 EV2 SERVICE INTEGRATION - STEP BY STEP TESTING
======================================================================

============================================================
TEST 1: Service Health Check
============================================================
✅ Service is running
   Status: ready
   Generations processed: 0

============================================================
TEST 2: Notification Mechanism
============================================================
✅ Notification sent successfully
   Response: {
     "status": "received",
     "generation": 1,
     ...
   }

============================================================
TEST 3: Service State After Notifications
============================================================
✅ Service state retrieved
   Total generations: 1
   Agent triggered: 0 times  ← 关键：不触发 agent
   Last generation: 1

============================================================
TEST 4: Mini Evolution WITHOUT Service (Baseline)
============================================================
📁 Results dir: /tmp/test_shinka_baseline
🚀 Starting evolution (3 generations)...
✅ Evolution runner initialized successfully
   - eval_service_url: None
   - results_dir: /tmp/test_shinka_baseline

============================================================
TEST 5: Mini Evolution WITH Service (Should be Identical)
============================================================
📁 Results dir: /tmp/test_shinka_with_service
🚀 Starting evolution (3 generations)...
✅ Evolution runner initialized successfully
   - eval_service_url: http://localhost:8765
   - results_dir: /tmp/test_shinka_with_service
✅ Service URL correctly configured

======================================================================
📊 TEST SUMMARY
======================================================================
  ✅ PASS  Service Health
  ✅ PASS  Notification Mechanism
  ✅ PASS  Service State Check
  ✅ PASS  Evolution WITHOUT Service
  ✅ PASS  Evolution WITH Service
======================================================================
🎉 All tests passed! Integration is working correctly.
======================================================================
```

### 验证要点

- ✅ Service 接收通知
- ✅ `agent_triggered_count = 0`（没有触发）
- ✅ 两种模式初始化都成功
- ✅ 配置正确传递

---

## 📋 Phase 3: 结果一致性测试

**目标**: 验证有/无 service 的演化结果完全相同

### Step 1: 准备测试实验

选择一个**已知的、可复现的**实验：

```python
# test_consistency.py
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig

def run_experiment(with_service=False, run_id="baseline"):
    """Run a small experiment."""
    
    results_dir = f"/tmp/consistency_test_{run_id}"
    
    evo_config = EvolutionConfig(
        num_generations=10,  # Small but meaningful
        max_parallel_jobs=2,
        results_dir=results_dir,
        # ... your actual config ...
        eval_service_url="http://localhost:8765" if with_service else None
    )
    
    # ... rest of your config ...
    
    runner = EvolutionRunner(evo_config, job_config, db_config)
    runner.run()
    
    return results_dir

# Run both
baseline_dir = run_experiment(with_service=False, run_id="baseline")
with_service_dir = run_experiment(with_service=True, run_id="with_service")

print(f"Baseline: {baseline_dir}")
print(f"With service: {with_service_dir}")
```

### Step 2: 运行实验

```bash
# Terminal 1: Service (passive mode)
uv run eval_agent/ev2_service_standalone.py \
  --config eval_agent/ev2_service_config_passive.yaml

# Terminal 2: Run experiments
uv run test_consistency.py
```

### Step 3: 对比结果

```bash
# Compare database
sqlite3 /tmp/consistency_test_baseline/evolution.db \
  "SELECT generation, combined_score FROM programs ORDER BY generation"

sqlite3 /tmp/consistency_test_with_service/evolution.db \
  "SELECT generation, combined_score FROM programs ORDER BY generation"

# Should be IDENTICAL (or very close due to randomness)
```

### 预期结果

- ✅ 两个实验的 `combined_score` 轨迹相同（如果固定随机种子）
- ✅ 程序数量相同
- ✅ 运行时间相近（差异 < 1%）
- ✅ Service 日志显示收到通知但未触发 agent

---

## 📋 Phase 4: 完整集成测试

**目标**: 启用 agent，验证辅助指标生成

### Step 1: 配置 Agent 触发

```bash
# 编辑 eval_agent/ev2_service_config.yaml
# 设置合理的触发间隔
```

```yaml
trigger_strategy:
  type: "periodic"
  interval: 5  # 每 5 代触发一次
```

### Step 2: 准备 Primary Evaluator

确保你的主评估器路径正确：

```yaml
primary_evaluator:
  path: "/home/tengxiao/pj/ShinkaEvolve/examples/circle_packing/evaluate_ori.py"
```

### Step 3: 启动 Service（Active Mode）

```bash
# Terminal 1
uv run eval_agent/ev2_service_standalone.py \
  --config eval_agent/ev2_service_config.yaml
```

### Step 4: 运行实验

```bash
# Terminal 2
uv run my/experiment_with_eval_service.py
```

### 预期行为

**Generation 1-4:**
```
Service: ✅ Generation 1 completed (score: 0.50)
Service: ⏳ Not triggering (interval=5, current=1)
Service: ✅ Generation 2 completed (score: 0.52)
Service: ⏳ Not triggering (interval=5, current=2)
...
```

**Generation 5:**
```
Service: ✅ Generation 5 completed (score: 0.58)
Service: 🎯 Trigger condition met (periodic: interval=5)
Service: 🤖 Launching agent...
Agent:   📊 Analyzing 5 generations...
Agent:   🔍 Reading primary evaluator...
Agent:   💡 Generating auxiliary metrics...
Agent:   ✅ Created aux_metrics.py
Service: ✅ Agent completed in 45.2s
Service: 📄 Analysis saved to eval_agent_memory/EVAL_AGENTS.md
```

**Generation 6-9:**
```
Service: ⏳ Not triggering...
```

**Generation 10:**
```
Service: 🎯 Trigger condition met
Service: 🤖 Launching agent...
...
```

### 验证输出

```bash
# 检查 agent 输出
ls -la results_dir/eval_agent_memory/
# 应该看到:
# - EVAL_AGENTS.md
# - aux_metrics.py
# - workspace/

# 查看分析报告
cat results_dir/eval_agent_memory/EVAL_AGENTS.md

# 验证辅助指标
python -m py_compile results_dir/eval_agent_memory/aux_metrics.py
```

---

## 🧪 完整测试脚本（真实实验）

### 使用现有的 Circle Packing 实验

```python
# eval_agent/test_real_integration.py
"""
Real integration test using Circle Packing example.
"""

import sys
import shutil
from pathlib import Path

# Your existing imports
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig

def run_circle_packing_test(with_eval_service=False):
    """
    Run circle packing with/without eval service.
    
    Args:
        with_eval_service: Enable eval service integration
    """
    
    # Results directory
    suffix = "with_service" if with_eval_service else "baseline"
    results_dir = Path(f"/tmp/circle_packing_integration_test_{suffix}")
    
    # Clean previous run
    if results_dir.exists():
        shutil.rmtree(results_dir)
    results_dir.mkdir(parents=True)
    
    print("=" * 60)
    print(f"Running Circle Packing {'WITH' if with_eval_service else 'WITHOUT'} Eval Service")
    print(f"Results: {results_dir}")
    print("=" * 60)
    
    # Configuration
    evolution_config = EvolutionConfig(
        num_generations=10,  # Small for testing
        max_parallel_jobs=2,
        results_dir=str(results_dir),
        init_program_path="examples/circle_packing/initial.py",
        
        # Eval service (conditional)
        eval_service_url="http://localhost:8765" if with_eval_service else None,
        
        # ... rest of your config ...
    )
    
    job_config = LocalJobConfig(
        eval_program_path="examples/circle_packing/evaluate_ori.py",
    )
    
    db_config = DatabaseConfig()
    
    # Run
    runner = EvolutionRunner(
        evo_config=evolution_config,
        job_config=job_config,
        db_config=db_config,
        verbose=True
    )
    
    runner.run()
    
    print(f"\n✅ Completed: {results_dir}")
    return results_dir


if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--mode",
        choices=["baseline", "with-service", "both"],
        default="baseline",
        help="Test mode"
    )
    args = parser.parse_args()
    
    if args.mode in ["baseline", "both"]:
        baseline_dir = run_circle_packing_test(with_eval_service=False)
        print(f"\n📊 Baseline results: {baseline_dir}")
    
    if args.mode in ["with-service", "both"]:
        service_dir = run_circle_packing_test(with_eval_service=True)
        print(f"\n📊 With-service results: {service_dir}")
        
        # Check for agent output
        agent_memory = Path(service_dir) / "eval_agent_memory"
        if agent_memory.exists():
            print(f"\n✅ Agent memory found:")
            for f in agent_memory.iterdir():
                print(f"   - {f.name}")
        else:
            print(f"\n⚠️  No agent memory (agent not triggered yet?)")
```

### 运行完整测试

```bash
# Terminal 1: Service (active mode, interval=5)
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml

# Terminal 2: Baseline only
uv run eval_agent/test_real_integration.py --mode baseline

# Terminal 2: With service only
uv run eval_agent/test_real_integration.py --mode with-service

# Terminal 2: Both (for comparison)
uv run eval_agent/test_real_integration.py --mode both
```

---

## ✅ 验证检查清单

### Phase 2: 基础设施

- [ ] Service 启动成功（passive mode）
- [ ] 通知发送成功
- [ ] Service 接收通知
- [ ] `agent_triggered_count = 0`（passive mode）
- [ ] 有/无 service 的初始化都成功

### Phase 3: 结果一致性

- [ ] Baseline 实验完成
- [ ] With-service 实验完成
- [ ] 两者的 `combined_score` 轨迹相同/相近
- [ ] 运行时间差异 < 1%
- [ ] Service 日志显示收到所有通知

### Phase 4: 完整集成

- [ ] Service 启动（active mode）
- [ ] Agent 在预期代数触发（gen 5, 10, ...）
- [ ] `EVAL_AGENTS.md` 生成
- [ ] `aux_metrics.py` 生成且语法正确
- [ ] Primary metric 未被修改
- [ ] Evolution 正常完成

---

## 🐛 故障排除

### Service 收不到通知

**检查:**
```bash
# Service 是否运行？
curl http://localhost:8765/api/v1/status

# 检查 runner.py 日志
grep "Notified eval service" results_dir/evolution_run.log
grep "Failed to notify eval service" results_dir/evolution_run.log
```

### 通知发送但无响应

**可能原因:**
- Service 崩溃了（检查 Terminal 1）
- 端口被占用（检查 `netstat -tuln | grep 8765`）
- 网络问题（防火墙？）

### Agent 不触发

**检查:**
1. Service 模式：`ev2_service_config.yaml` 还是 `ev2_service_config_passive.yaml`？
2. Interval 设置：是否太大（999999）？
3. Generation 数量：是否少于 interval？

### 结果不一致

**正常情况:**
- 有随机性的演化：结果略有不同
- LLM 调用：每次可能不同

**异常情况:**
- Score 差异 > 10%：检查是否 agent 修改了 primary evaluator
- 运行时间差异 > 5%：检查网络延迟或超时

---

## 📊 当前进度

```
✅ Phase 1: 基础功能测试（已完成）
🔄 Phase 2: 基础设施测试（进行中）
⏳ Phase 3: 结果一致性测试
⏳ Phase 4: 完整集成测试
```

**下一步**: 运行 Phase 2 测试

```bash
# Terminal 1
uv run eval_agent/ev2_service_standalone.py \
  --config eval_agent/ev2_service_config_passive.yaml

# Terminal 2
uv run eval_agent/test_integration_step_by_step.py
```