File size: 14,196 Bytes
3f6526a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
# EV2 Service Integration - Testing Guide

## 🎯 测试策略

我们采用**渐进式测试**策略,确保每一步都验证正确:

```
Phase 1: 基础功能 ✓
  └─ 配置加载、方法存在

Phase 2: 基础设施 ← 当前阶段
  ├─ 服务健康检查
  ├─ 通知机制
  └─ 无副作用验证

Phase 3: 结果一致性
  ├─ 无 service 运行(baseline)
  ├─ 有 service(passive mode)
  └─ 对比结果(应该完全相同)

Phase 4: 完整集成
  └─ 启用 agent,验证辅助指标生成
```

---

## 📋 Phase 1: 基础功能测试 ✅

**目标**: 验证代码修改正确,不破坏现有功能

### 运行测试

```bash
cd /home/tengxiao/pj/ShinkaEvolve
uv run eval_agent/test_integration_basic.py
```

### 预期结果

```
============================================================
EV2 Service Integration - Basic Tests
============================================================
Test 1: Backward compatibility (default config)...
  ✅ Default config: eval_service_url=None

Test 2: Enable eval service...
  ✅ Config with service: eval_service_url='http://localhost:8765'

Test 3: Set via kwargs...
  ✅ Kwargs config works correctly

Test 4: _notify_eval_service method exists...
  ✅ _notify_eval_service method exists
     - Parameters: ['self', 'generation', 'combined_score', 'results_dir']

============================================================
✅ All basic integration tests passed!
============================================================
```

**✅ 已完成!**

---

## 📋 Phase 2: 基础设施测试(Infrastructure)

**目标**: 验证通知机制工作,但不触发 agent(无副作用)

### Step 1: 启动 Service(Passive Mode)

```bash
# Terminal 1
cd /home/tengxiao/pj/ShinkaEvolve

# 使用 passive 配置(不会触发 agent)
uv run eval_agent/ev2_service_standalone.py \
  --config eval_agent/ev2_service_config_passive.yaml
```

**Passive Mode 特点:**
- ✅ 接收通知
- ✅ 记录状态
- ❌ 不触发 agent(interval=999999)
- ✅ 零副作用

### Step 2: 运行基础设施测试

```bash
# Terminal 2
cd /home/tengxiao/pj/ShinkaEvolve
uv run eval_agent/test_integration_step_by_step.py
```

### 预期结果

```
======================================================================
🧪 EV2 SERVICE INTEGRATION - STEP BY STEP TESTING
======================================================================

============================================================
TEST 1: Service Health Check
============================================================
✅ Service is running
   Status: ready
   Generations processed: 0

============================================================
TEST 2: Notification Mechanism
============================================================
✅ Notification sent successfully
   Response: {
     "status": "received",
     "generation": 1,
     ...
   }

============================================================
TEST 3: Service State After Notifications
============================================================
✅ Service state retrieved
   Total generations: 1
   Agent triggered: 0 times  ← 关键:不触发 agent
   Last generation: 1

============================================================
TEST 4: Mini Evolution WITHOUT Service (Baseline)
============================================================
📁 Results dir: /tmp/test_shinka_baseline
🚀 Starting evolution (3 generations)...
✅ Evolution runner initialized successfully
   - eval_service_url: None
   - results_dir: /tmp/test_shinka_baseline

============================================================
TEST 5: Mini Evolution WITH Service (Should be Identical)
============================================================
📁 Results dir: /tmp/test_shinka_with_service
🚀 Starting evolution (3 generations)...
✅ Evolution runner initialized successfully
   - eval_service_url: http://localhost:8765
   - results_dir: /tmp/test_shinka_with_service
✅ Service URL correctly configured

======================================================================
📊 TEST SUMMARY
======================================================================
  ✅ PASS  Service Health
  ✅ PASS  Notification Mechanism
  ✅ PASS  Service State Check
  ✅ PASS  Evolution WITHOUT Service
  ✅ PASS  Evolution WITH Service
======================================================================
🎉 All tests passed! Integration is working correctly.
======================================================================
```

### 验证要点

- ✅ Service 接收通知
- ✅ `agent_triggered_count = 0`(没有触发)
- ✅ 两种模式初始化都成功
- ✅ 配置正确传递

---

## 📋 Phase 3: 结果一致性测试

**目标**: 验证有/无 service 的演化结果完全相同

### Step 1: 准备测试实验

选择一个**已知的、可复现的**实验:

```python
# test_consistency.py
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig

def run_experiment(with_service=False, run_id="baseline"):
    """Run a small experiment."""
    
    results_dir = f"/tmp/consistency_test_{run_id}"
    
    evo_config = EvolutionConfig(
        num_generations=10,  # Small but meaningful
        max_parallel_jobs=2,
        results_dir=results_dir,
        # ... your actual config ...
        eval_service_url="http://localhost:8765" if with_service else None
    )
    
    # ... rest of your config ...
    
    runner = EvolutionRunner(evo_config, job_config, db_config)
    runner.run()
    
    return results_dir

# Run both
baseline_dir = run_experiment(with_service=False, run_id="baseline")
with_service_dir = run_experiment(with_service=True, run_id="with_service")

print(f"Baseline: {baseline_dir}")
print(f"With service: {with_service_dir}")
```

### Step 2: 运行实验

```bash
# Terminal 1: Service (passive mode)
uv run eval_agent/ev2_service_standalone.py \
  --config eval_agent/ev2_service_config_passive.yaml

# Terminal 2: Run experiments
uv run test_consistency.py
```

### Step 3: 对比结果

```bash
# Compare database
sqlite3 /tmp/consistency_test_baseline/evolution.db \
  "SELECT generation, combined_score FROM programs ORDER BY generation"

sqlite3 /tmp/consistency_test_with_service/evolution.db \
  "SELECT generation, combined_score FROM programs ORDER BY generation"

# Should be IDENTICAL (or very close due to randomness)
```

### 预期结果

- ✅ 两个实验的 `combined_score` 轨迹相同(如果固定随机种子)
- ✅ 程序数量相同
- ✅ 运行时间相近(差异 < 1%)
- ✅ Service 日志显示收到通知但未触发 agent

---

## 📋 Phase 4: 完整集成测试

**目标**: 启用 agent,验证辅助指标生成

### Step 1: 配置 Agent 触发

```bash
# 编辑 eval_agent/ev2_service_config.yaml
# 设置合理的触发间隔
```

```yaml
trigger_strategy:
  type: "periodic"
  interval: 5  # 每 5 代触发一次
```

### Step 2: 准备 Primary Evaluator

确保你的主评估器路径正确:

```yaml
primary_evaluator:
  path: "/home/tengxiao/pj/ShinkaEvolve/examples/circle_packing/evaluate_ori.py"
```

### Step 3: 启动 Service(Active Mode)

```bash
# Terminal 1
uv run eval_agent/ev2_service_standalone.py \
  --config eval_agent/ev2_service_config.yaml
```

### Step 4: 运行实验

```bash
# Terminal 2
uv run my/experiment_with_eval_service.py
```

### 预期行为

**Generation 1-4:**
```
Service: ✅ Generation 1 completed (score: 0.50)
Service: ⏳ Not triggering (interval=5, current=1)
Service: ✅ Generation 2 completed (score: 0.52)
Service: ⏳ Not triggering (interval=5, current=2)
...
```

**Generation 5:**
```
Service: ✅ Generation 5 completed (score: 0.58)
Service: 🎯 Trigger condition met (periodic: interval=5)
Service: 🤖 Launching agent...
Agent:   📊 Analyzing 5 generations...
Agent:   🔍 Reading primary evaluator...
Agent:   💡 Generating auxiliary metrics...
Agent:   ✅ Created aux_metrics.py
Service: ✅ Agent completed in 45.2s
Service: 📄 Analysis saved to eval_agent_memory/EVAL_AGENTS.md
```

**Generation 6-9:**
```
Service: ⏳ Not triggering...
```

**Generation 10:**
```
Service: 🎯 Trigger condition met
Service: 🤖 Launching agent...
...
```

### 验证输出

```bash
# 检查 agent 输出
ls -la results_dir/eval_agent_memory/
# 应该看到:
# - EVAL_AGENTS.md
# - aux_metrics.py
# - workspace/

# 查看分析报告
cat results_dir/eval_agent_memory/EVAL_AGENTS.md

# 验证辅助指标
python -m py_compile results_dir/eval_agent_memory/aux_metrics.py
```

---

## 🧪 完整测试脚本(真实实验)

### 使用现有的 Circle Packing 实验

```python
# eval_agent/test_real_integration.py
"""
Real integration test using Circle Packing example.
"""

import sys
import shutil
from pathlib import Path

# Your existing imports
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig

def run_circle_packing_test(with_eval_service=False):
    """
    Run circle packing with/without eval service.
    
    Args:
        with_eval_service: Enable eval service integration
    """
    
    # Results directory
    suffix = "with_service" if with_eval_service else "baseline"
    results_dir = Path(f"/tmp/circle_packing_integration_test_{suffix}")
    
    # Clean previous run
    if results_dir.exists():
        shutil.rmtree(results_dir)
    results_dir.mkdir(parents=True)
    
    print("=" * 60)
    print(f"Running Circle Packing {'WITH' if with_eval_service else 'WITHOUT'} Eval Service")
    print(f"Results: {results_dir}")
    print("=" * 60)
    
    # Configuration
    evolution_config = EvolutionConfig(
        num_generations=10,  # Small for testing
        max_parallel_jobs=2,
        results_dir=str(results_dir),
        init_program_path="examples/circle_packing/initial.py",
        
        # Eval service (conditional)
        eval_service_url="http://localhost:8765" if with_eval_service else None,
        
        # ... rest of your config ...
    )
    
    job_config = LocalJobConfig(
        eval_program_path="examples/circle_packing/evaluate_ori.py",
    )
    
    db_config = DatabaseConfig()
    
    # Run
    runner = EvolutionRunner(
        evo_config=evolution_config,
        job_config=job_config,
        db_config=db_config,
        verbose=True
    )
    
    runner.run()
    
    print(f"\n✅ Completed: {results_dir}")
    return results_dir


if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--mode",
        choices=["baseline", "with-service", "both"],
        default="baseline",
        help="Test mode"
    )
    args = parser.parse_args()
    
    if args.mode in ["baseline", "both"]:
        baseline_dir = run_circle_packing_test(with_eval_service=False)
        print(f"\n📊 Baseline results: {baseline_dir}")
    
    if args.mode in ["with-service", "both"]:
        service_dir = run_circle_packing_test(with_eval_service=True)
        print(f"\n📊 With-service results: {service_dir}")
        
        # Check for agent output
        agent_memory = Path(service_dir) / "eval_agent_memory"
        if agent_memory.exists():
            print(f"\n✅ Agent memory found:")
            for f in agent_memory.iterdir():
                print(f"   - {f.name}")
        else:
            print(f"\n⚠️  No agent memory (agent not triggered yet?)")
```

### 运行完整测试

```bash
# Terminal 1: Service (active mode, interval=5)
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml

# Terminal 2: Baseline only
uv run eval_agent/test_real_integration.py --mode baseline

# Terminal 2: With service only
uv run eval_agent/test_real_integration.py --mode with-service

# Terminal 2: Both (for comparison)
uv run eval_agent/test_real_integration.py --mode both
```

---

## ✅ 验证检查清单

### Phase 2: 基础设施

- [ ] Service 启动成功(passive mode)
- [ ] 通知发送成功
- [ ] Service 接收通知
- [ ] `agent_triggered_count = 0`(passive mode)
- [ ] 有/无 service 的初始化都成功

### Phase 3: 结果一致性

- [ ] Baseline 实验完成
- [ ] With-service 实验完成
- [ ] 两者的 `combined_score` 轨迹相同/相近
- [ ] 运行时间差异 < 1%
- [ ] Service 日志显示收到所有通知

### Phase 4: 完整集成

- [ ] Service 启动(active mode)
- [ ] Agent 在预期代数触发(gen 5, 10, ...)
- [ ] `EVAL_AGENTS.md` 生成
- [ ] `aux_metrics.py` 生成且语法正确
- [ ] Primary metric 未被修改
- [ ] Evolution 正常完成

---

## 🐛 故障排除

### Service 收不到通知

**检查:**
```bash
# Service 是否运行?
curl http://localhost:8765/api/v1/status

# 检查 runner.py 日志
grep "Notified eval service" results_dir/evolution_run.log
grep "Failed to notify eval service" results_dir/evolution_run.log
```

### 通知发送但无响应

**可能原因:**
- Service 崩溃了(检查 Terminal 1)
- 端口被占用(检查 `netstat -tuln | grep 8765`- 网络问题(防火墙?)

### Agent 不触发

**检查:**
1. Service 模式:`ev2_service_config.yaml` 还是 `ev2_service_config_passive.yaml`2. Interval 设置:是否太大(999999)?
3. Generation 数量:是否少于 interval?

### 结果不一致

**正常情况:**
- 有随机性的演化:结果略有不同
- LLM 调用:每次可能不同

**异常情况:**
- Score 差异 > 10%:检查是否 agent 修改了 primary evaluator
- 运行时间差异 > 5%:检查网络延迟或超时

---

## 📊 当前进度

```
✅ Phase 1: 基础功能测试(已完成)
🔄 Phase 2: 基础设施测试(进行中)
⏳ Phase 3: 结果一致性测试
⏳ Phase 4: 完整集成测试
```

**下一步**: 运行 Phase 2 测试

```bash
# Terminal 1
uv run eval_agent/ev2_service_standalone.py \
  --config eval_agent/ev2_service_config_passive.yaml

# Terminal 2
uv run eval_agent/test_integration_step_by_step.py
```