shinka-backup / eval_agent /design_draft /STEP1_TESTING_GUIDE.md

Add files using upload-large-folder tool

3f6526a verified about 1 month ago

10 kB

	# Step 1: Testing Guide for EV2 Service

	## 🎯 What We Built

	A minimal HTTP service wrapper around `ev2.py` that:
	- ✅ Receives generation completion notifications
	- ✅ Autonomously decides when to trigger EV2 agent
	- ✅ Maintains persistent state across generations
	- ✅ Requires minimal changes to ShinkaEvolve

	## 📋 File Overview

	```
	eval_agent/
	├── ev2_service.py # The HTTP service (NEW)
	├── ev2_service_config.yaml # Configuration file (NEW)
	├── test_ev2_service.py # Test script (NEW)
	├── ev2.py # Original agent logic (UNCHANGED)
	└── ev2_prompt.j2 # Agent prompt (UNCHANGED)
	```

	## 🚀 Step-by-Step Testing

	### Step 1: Install Dependencies

	```bash
	cd /home/tengxiao/pj/ShinkaEvolve
	source venv/bin/activate

	# Install FastAPI and Uvicorn
	pip install fastapi uvicorn pyyaml
	```

	### Step 2: Configure the Service

	Edit `eval_agent/ev2_service_config.yaml` if needed:

	```yaml
	experiment:
	results_dir: "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215"
	primary_evaluator: "examples/circle_packing/evaluate_ori.py"

	strategy:
	trigger_mode: "periodic" # Options: always, periodic, plateau, mixed
	trigger_interval: 10 # Run agent every 10 generations
	```

	Trigger Modes:
	- `always`: Run agent every generation (for testing)
	- `periodic`: Run every N generations
	- `plateau`: Run when score plateaus
	- `mixed`: Run on periodic OR plateau (whichever comes first)

	### Step 3: Start the Service

	Terminal 1 (Service):

	```bash
	cd /home/tengxiao/pj/ShinkaEvolve
	source venv/bin/activate

	# Start the service
	python eval_agent/ev2_service.py --config eval_agent/ev2_service_config.yaml
	```

	Expected output:
	```
	INFO: Started server process [12345]
	INFO: Waiting for application startup.
	2026-02-02 15:30:00 - __main__ - INFO - 🚀 Starting EV2 Evaluation Service...
	2026-02-02 15:30:00 - __main__ - INFO - ✅ Service started
	2026-02-02 15:30:00 - __main__ - INFO - Experiment: circle_packing_NO_vision
	2026-02-02 15:30:00 - __main__ - INFO - Trigger mode: periodic
	2026-02-02 15:30:00 - __main__ - INFO - Trigger interval: 10
	INFO: Application startup complete.
	INFO: Uvicorn running on http://0.0.0.0:8765 (Press CTRL+C to quit)
	```

	### Step 4: Test the Service

	Terminal 2 (Test):

	```bash
	cd /home/tengxiao/pj/ShinkaEvolve
	source venv/bin/activate

	# Test 1: Check service status
	python eval_agent/test_ev2_service.py \
	--results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
	--test-mode status
	```

	Expected output:
	```
	🔍 Testing service status...
	✅ Service is running!
	Uptime: 12.3s
	Trigger mode: periodic
	Trigger interval: 10
	```

	```bash
	# Test 2: Simulate evolution (25 generations)
	python eval_agent/test_ev2_service.py \
	--results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
	--test-mode simulate \
	--num-gens 25
	```

	Expected output:
	```
	🧬 Simulating evolution with 25 generations...
	======================================================================

	📤 Sending notification: gen=0, score=2.4000
	Status: skipped
	Agent triggered: False
	Reason: Not yet (last trigger at gen -1)
	Processing time: 5.2ms

	📤 Sending notification: gen=1, score=2.4050
	Status: skipped
	Agent triggered: False
	Reason: Not yet (last trigger at gen -1)
	Processing time: 3.1ms

	...

	📤 Sending notification: gen=10, score=2.4500
	Status: success
	Agent triggered: True
	Reason: Periodic trigger (interval=10)
	Processing time: 15234.5ms
	Insights: 3 found

	...

	📤 Sending notification: gen=20, score=2.4950
	Status: success
	Agent triggered: True
	Reason: Periodic trigger (interval=10)
	Processing time: 12456.7ms
	Insights: 3 found

	======================================================================
	✅ Simulation complete!
	```

	### Step 5: Check Results

	The service creates/updates:

	```
	examples/circle_packing/results/.../
	└── eval_agent_memory/
	├── EVAL_AGENTS.md # Updated by agent
	├── auxiliary_metrics.py # Created by agent
	└── service_state.json # Service state (NEW)
	```

	Check service state:
	```bash
	cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/service_state.json
	```

	### Step 6: Test Manual Trigger (Optional)

	```bash
	# Manually trigger agent for generation 5
	python eval_agent/test_ev2_service.py \
	--results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
	--test-mode manual \
	--generation 5
	```

	## 🔌 API Documentation

	The service provides these endpoints:

	### 1. Generation Notification (Main)

	```bash
	curl -X POST http://localhost:8765/api/v1/notify/generation_complete \
	-H "Content-Type: application/json" \
	-d '{
	"generation": 42,
	"results_dir": "/path/to/results",
	"primary_score": 2.5407
	}'
	```

	Response:
	```json
	{
	"status": "success",
	"message": "Periodic trigger (interval=10)",
	"generation": 42,
	"agent_triggered": true,
	"trigger_reason": "Periodic trigger (interval=10)",
	"insights": ["..."],
	"auxiliary_metrics": {...},
	"processing_time_ms": 15234.5
	}
	```

	### 2. Service Status

	```bash
	curl http://localhost:8765/api/v1/status
	```

	### 3. Manual Trigger

	```bash
	curl -X POST "http://localhost:8765/api/v1/trigger/manual?generation=10"
	```

	### 4. Interactive Docs

	Open in browser: http://localhost:8765/docs

	## 🔧 Integration with ShinkaEvolve

	To integrate with ShinkaEvolve, add this to `EvolutionRunner`:

	```python
	# shinka/core/runner.py

	class EvolutionRunner:
	def __init__(self, config: EvolutionConfig):
	self.config = config

	# Initialize eval service client (optional)
	self.eval_service_url = config.eval_service_url if hasattr(config, 'eval_service_url') else None

	def _evaluate_generation(self, generation: int, code_path: str, results_dir: str):
	# Run normal evaluation (unchanged)
	results, score = self.scheduler.run(code_path, results_dir)

	# Notify eval service (NEW, non-blocking)
	if self.eval_service_url:
	try:
	import requests
	requests.post(
	f"{self.eval_service_url}/api/v1/notify/generation_complete",
	json={
	"generation": generation,
	"results_dir": results_dir,
	"primary_score": score
	},
	timeout=1 # Short timeout, fire-and-forget
	)
	except Exception as e:
	self.logger.warning(f"Eval service notification failed: {e}")
	# Continue regardless

	return results, score
	```

	Changes required: ~10 lines of code!

	## 📊 Service Decision Logic

	The service decides autonomously when to trigger the agent:

	```python
	Generation 0: score=2.40 → SKIP (not yet, interval=10)
	Generation 1: score=2.41 → SKIP
	...
	Generation 10: score=2.45 → TRIGGER (periodic, interval=10) ✅
	Generation 11: score=2.46 → SKIP
	...
	Generation 20: score=2.49 → TRIGGER (periodic, interval=10) ✅
	...
	```

	With `trigger_mode: "mixed"`:
	```python
	Generation 0: score=2.40 → SKIP
	Generation 5: score=2.40 → TRIGGER (plateau detected!) ✅
	Generation 10: score=2.45 → TRIGGER (periodic) ✅
	...
	```

	## 🎯 What This Achieves

	### Before (without service):
	```python
	# In ShinkaEvolve
	for gen in range(num_generations):
	score = evaluate(gen)
	# No auxiliary metrics
	# No intelligent analysis
	```

	### After (with service):
	```python
	# In ShinkaEvolve (minimal change)
	for gen in range(num_generations):
	score = evaluate(gen)
	notify_service(gen, score) # ← Just one line!

	# Service independently:
	# - Decides when to analyze
	# - Runs EV2 agent
	# - Creates auxiliary metrics
	# - Accumulates insights
	```

	## ✅ Success Criteria

	You've successfully tested Step 1 if:

	1. ✅ Service starts without errors
	2. ✅ Service responds to notifications
	3. ✅ Service correctly skips some generations (based on strategy)
	4. ✅ Service triggers agent at the right times
	5. ✅ Agent creates/updates EVAL_AGENTS.md and auxiliary_metrics.py
	6. ✅ Service state persists (check service_state.json)

	## 🐛 Troubleshooting

	### Service won't start

	Error: `ModuleNotFoundError: No module named 'fastapi'`
	Fix: `pip install fastapi uvicorn pyyaml`

	### Service starts but test fails

	Error: `Cannot connect to service`
	Fix: Check if service is running on port 8765. Try: `curl http://localhost:8765/`

	### Agent doesn't trigger

	Check:
	1. Is `agent_enabled: true` in config?
	2. Are you sending enough generations? (interval=10 means trigger at gen 10, 20, 30...)
	3. Check service logs in Terminal 1

	### Agent fails to run

	Error in service logs: `Primary evaluator not found`
	Fix: Check `primary_evaluator` path in config is correct

	## 🚀 Next Steps

	After Step 1 works:

	Step 2: Add intelligent decision-making
	- More sophisticated trigger strategies
	- Plateau detection improvements
	- Alert levels

	Step 3: Add persistent memory
	- SQLite database for history
	- Metric tracking
	- Correlation analysis

	Step 4: Add MetricUnit management
	- Object-oriented metrics
	- Lifecycle management
	- Validation system

	## 📝 Notes

	- The service is stateless regarding ShinkaEvolve - it doesn't block or affect the evolution process
	- If the service crashes, ShinkaEvolve continues normally (fire-and-forget)
	- Service state is saved to disk, so it survives restarts
	- All agent logic from `ev2.py` is preserved and unchanged

	---

	Ready to test? Start the service and run the tests! 🚀