Add files using upload-large-folder tool

3f6526a verified 29 days ago

11.9 kB

	# EV2 Eval Service - Usage Guide

	## 📖 Overview

	The EV2 Eval Service is now integrated into ShinkaEvolve as an optional, non-blocking feature. It provides:

	- ✅ Dynamic metric evolution during code evolution
	- ✅ Persistent memory across evolution runs
	- ✅ Real-time supervision of evolution progress
	- ✅ Autonomous decision-making (when to trigger analysis)
	- ✅ Zero impact on evolution if disabled or unavailable

	---

	## 🚀 Quick Start

	### Step 1: Start the EV2 Service

	In a separate terminal:

	```bash
	cd /home/tengxiao/pj/ShinkaEvolve

	# Start the service
	uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml
	```

	The service will start on `http://localhost:8765` by default.

	### Step 2: Enable in Your Experiment

	Method A: Python Code

	```python
	from shinka.core import EvolutionRunner, EvolutionConfig
	from shinka.launch import LocalJobConfig
	from shinka.database import DatabaseConfig

	# Create evolution config with eval service enabled
	evolution_config = EvolutionConfig(
	num_generations=100,
	max_parallel_jobs=4,
	# ... other configs ...

	# Enable EV2 Eval Service
	eval_service_url="http://localhost:8765"
	)

	# Run evolution as usual
	runner = EvolutionRunner(
	evo_config=evolution_config,
	job_config=job_config,
	db_config=db_config
	)

	runner.run() # Service will be notified automatically
	```

	Method B: YAML Config

	```yaml
	# experiment_config.yaml
	evolution:
	num_generations: 100
	max_parallel_jobs: 4
	# ... other configs ...

	# Enable EV2 Eval Service
	eval_service_url: "http://localhost:8765"
	```

	Then load it in your script:

	```python
	import yaml
	from shinka.core import EvolutionConfig

	with open("experiment_config.yaml") as f:
	config_data = yaml.safe_load(f)

	evo_config = EvolutionConfig(**config_data["evolution"])
	```

	### Step 3: Run Evolution

	```bash
	uv run my/experiment_script.py
	```

	What happens:
	1. Evolution runs normally
	2. After each generation, ShinkaEvolve notifies the service
	3. Service autonomously decides when to trigger agent analysis
	4. Agent generates auxiliary metrics (stored in `results_dir/eval_agent_memory/`)
	5. Evolution continues unaffected

	---

	## 📁 File Locations

	### Service Configuration

	- Config: `eval_agent/ev2_service_config.yaml`
	- Service: `eval_agent/ev2_service_standalone.py`
	- System Prompt: `eval_agent/ev2_prompt.j2`

	### Agent Output

	During evolution, the agent creates:

	```
	results_dir/
	└── eval_agent_memory/
	├── EVAL_AGENTS.md # Analysis reports
	├── aux_metrics.py # Generated auxiliary metrics
	└── workspace/ # Agent workspace
	```

	---

	## ⚙️ Service Configuration

	Edit `eval_agent/ev2_service_config.yaml`:

	```yaml
	# Trigger Strategy
	trigger_strategy:
	type: "periodic" # or "plateau" or "mixed"
	interval: 5 # Trigger every N generations
	patience: 3 # For plateau detection
	min_improvement: 0.01

	# LLM Configuration
	llm:
	model: "vertex_ai/gemini-2.5-flash"
	api_key_env: "LLM_API_KEY"
	base_url_env: "LLM_BASE_URL"

	# Service Settings
	service:
	host: "0.0.0.0"
	port: 8765
	```

	### Trigger Strategies

	1. Periodic (Default)
	- Triggers every N generations
	- Simple and predictable
	- Best for: Regular monitoring

	```yaml
	trigger_strategy:
	type: "periodic"
	interval: 5 # Every 5 generations
	```

	2. Plateau Detection
	- Triggers when improvement stagnates
	- Adaptive and efficient
	- Best for: Long runs with varying progress

	```yaml
	trigger_strategy:
	type: "plateau"
	patience: 3 # Wait 3 gens without improvement
	min_improvement: 0.01 # Threshold for "improvement"
	```

	3. Mixed (Recommended)
	- Combines periodic + plateau
	- Balanced approach
	- Best for: Production use

	```yaml
	trigger_strategy:
	type: "mixed"
	interval: 10 # Max 10 gens between triggers
	patience: 3
	min_improvement: 0.01
	```

	---

	## 🔌 API Endpoints

	### Check Service Status

	```bash
	curl http://localhost:8765/api/v1/status
	```

	Response:
	```json
	{
	"status": "ready",
	"total_generations": 15,
	"agent_triggered_count": 3,
	"last_generation": 15,
	"last_trigger_generation": 15,
	"service_uptime_seconds": 1234.56
	}
	```

	### Manual Trigger (Optional)

	Force agent analysis:

	```bash
	curl -X POST http://localhost:8765/api/v1/notify/generation_complete \
	-H "Content-Type: application/json" \
	-d '{
	"generation": 10,
	"results_dir": "/path/to/results",
	"primary_score": 0.85
	}'
	```

	---

	## 🧪 Testing

	### Test 1: Basic Integration (No Service)

	Test backward compatibility:

	```bash
	# Don't start the service
	uv run eval_agent/test_integration_basic.py
	```

	Expected:
	- ✅ All tests pass
	- ✅ Config works correctly
	- ✅ No errors

	### Test 2: Service Connectivity

	Test with service running:

	```bash
	# Terminal 1: Start service
	uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml

	# Terminal 2: Check status
	curl http://localhost:8765/api/v1/status
	```

	Expected:
	```json
	{"status": "ready", "total_generations": 0, ...}
	```

	### Test 3: Simulated Evolution

	Test notification flow:

	```bash
	# Terminal 1: Service running (see above)

	# Terminal 2: Simulate generations
	uv run eval_agent/test_ev2_service.py
	```

	Expected:
	- ✅ Notifications sent successfully
	- ✅ Agent triggered based on strategy
	- ✅ `EVAL_AGENTS.md` created in results directory

	---

	## 🐛 Troubleshooting

	### Service Not Responding

	Symptom:
	```
	Failed to notify eval service: Connection refused
	```

	Solution:
	1. Check if service is running: `curl http://localhost:8765/api/v1/status`
	2. Verify port is not in use: `netstat -tuln \| grep 8765`
	3. Check service logs for errors

	Note: Evolution continues normally even if service is down.

	### Agent Not Triggering

	Symptom: Service receives notifications but agent doesn't run.

	Check:
	1. View service logs to see trigger decision
	2. Check trigger strategy config (interval might be too high)
	3. Verify `primary_evaluator_path` in config

	### Memory/Workspace Issues

	Symptom: Agent fails with file not found errors.

	Solution:
	```bash
	# Clean old agent memory
	rm -rf results_dir/eval_agent_memory

	# Service will create fresh workspace on next trigger
	```

	### Import Errors

	Symptom:
	```
	ModuleNotFoundError: No module named 'requests'
	```

	Solution:
	Install missing dependencies:
	```bash
	uv pip install requests fastapi uvicorn pyyaml
	```

	---

	## 📊 Monitoring Evolution

	### During Evolution

	Watch service logs:
	```bash
	# Service terminal shows:
	✅ Generation 5 completed (score: 0.75)
	🎯 Trigger condition met (periodic: interval=5)
	🔄 Agent working...
	✅ Agent completed in 45.2s
	📊 Analysis saved to eval_agent_memory/EVAL_AGENTS.md
	```

	Check agent output:
	```bash
	# View latest analysis
	cat results_dir/eval_agent_memory/EVAL_AGENTS.md

	# View generated metrics
	cat results_dir/eval_agent_memory/aux_metrics.py
	```

	### After Evolution

	Service state:
	```bash
	curl http://localhost:8765/api/v1/status \| jq
	```

	Agent insights:
	```bash
	# Read all analyses
	ls -la results_dir/eval_agent_memory/
	```

	---

	## 🔒 Security Considerations

	### Network Access

	- Service binds to `0.0.0.0` by default (accessible from network)
	- For localhost-only: Change to `host: "127.0.0.1"` in config
	- No authentication required (trusted environment assumed)

	### Data Privacy

	- Service only receives: generation number, score, results_dir path
	- No code or sensitive data transmitted
	- All agent memory stored locally

	### Resource Limits

	- Each agent run can take 30-120 seconds
	- Configure `interval` to control frequency
	- Monitor disk usage of `eval_agent_memory/workspace/`

	---

	## 💡 Best Practices

	### 1. Start with Periodic Strategy

	```yaml
	trigger_strategy:
	type: "periodic"
	interval: 10 # Not too frequent
	```

	Why: Predictable, easy to debug, good baseline.

	### 2. Use Mixed for Long Runs

	```yaml
	trigger_strategy:
	type: "mixed"
	interval: 20
	patience: 5
	min_improvement: 0.02
	```

	Why: Adapts to evolution dynamics, saves tokens.

	### 3. Monitor First Few Triggers

	- Watch service logs for first 2-3 triggers
	- Verify agent completes successfully
	- Check `EVAL_AGENTS.md` quality
	- Adjust interval if needed

	### 4. Clean Memory Between Experiments

	```bash
	# Before new experiment
	rm -rf old_results_dir/eval_agent_memory
	```

	Why: Prevents cross-contamination of agent insights.

	### 5. Keep Service Running

	- Start service once, reuse for multiple experiments
	- Service maintains state across runs
	- Restart only when changing config

	---

	## 🎯 Example: Complete Workflow

	```bash
	# ========================================
	# Terminal 1: Start EV2 Service
	# ========================================
	cd /home/tengxiao/pj/ShinkaEvolve

	# Edit config if needed
	vim eval_agent/ev2_service_config.yaml

	# Start service
	uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml

	# Service logs:
	# 🚀 EV2 Service starting...
	# ✅ Service ready on http://0.0.0.0:8765


	# ========================================
	# Terminal 2: Run Evolution
	# ========================================
	cd /home/tengxiao/pj/ShinkaEvolve

	# Create experiment script
	cat > my/experiment_with_eval_service.py << 'EOF'
	from shinka.core import EvolutionRunner, EvolutionConfig
	from shinka.launch import LocalJobConfig
	from shinka.database import DatabaseConfig

	evo_config = EvolutionConfig(
	num_generations=50,
	max_parallel_jobs=4,
	eval_service_url="http://localhost:8765", # Enable service
	# ... other configs ...
	)

	runner = EvolutionRunner(evo_config, job_config, db_config)
	runner.run()
	EOF

	# Run evolution
	uv run my/experiment_with_eval_service.py

	# Evolution logs:
	# Generation 1/50 completed...
	# Generation 5/50 completed...
	# (Service notified automatically)


	# ========================================
	# Terminal 3: Monitor (Optional)
	# ========================================

	# Check service status
	watch -n 5 'curl -s http://localhost:8765/api/v1/status \| jq'

	# Watch agent output
	watch -n 10 'tail -20 results_dir/eval_agent_memory/EVAL_AGENTS.md'


	# ========================================
	# After Evolution: Review Results
	# ========================================

	# View all agent analyses
	cat results_dir/eval_agent_memory/EVAL_AGENTS.md

	# Check generated metrics
	cat results_dir/eval_agent_memory/aux_metrics.py

	# Service statistics
	curl http://localhost:8765/api/v1/status \| jq
	```

	---

	## 📚 Additional Resources

	- Integration Plan: `eval_agent/INTEGRATION_PLAN.md`
	- Service Design: `eval_agent/design_draft/HYBRID_EVAL_SERVICE_DESIGN.md`
	- System Prompt: `eval_agent/ev2_prompt.j2`
	- API Documentation: Visit `http://localhost:8765/docs` when service is running

	---

	## ❓ FAQ

	### Q: Does this slow down evolution?

	A: No. Notifications are fire-and-forget with 1-second timeout. Impact < 5ms per generation.

	### Q: What if the service crashes?

	A: Evolution continues unaffected. Service notifications fail silently (debug logs only).

	### Q: Can I disable the service mid-evolution?

	A: Yes. Just stop the service. Evolution won't be affected.

	### Q: How much does the agent cost (tokens)?

	A: Depends on trigger frequency and LLM model. Example:
	- Gemini Flash: ~$0.01-0.05 per trigger
	- Trigger every 10 gens: ~$0.50 for 100-gen run

	### Q: Can I use this with SLURM jobs?

	A: Yes. Make sure the service URL is accessible from compute nodes.

	### Q: Multiple experiments, one service?

	A: Yes! Service maintains separate state per `results_dir`.

	---

	Ready to evolve smarter? 🚀

	Start the service and add one line to your config:
	```python
	eval_service_url="http://localhost:8765"
	```