# EV2 Eval Service - Usage Guide ## ๐Ÿ“– Overview The EV2 Eval Service is now integrated into ShinkaEvolve as an **optional, non-blocking** feature. It provides: - โœ… **Dynamic metric evolution** during code evolution - โœ… **Persistent memory** across evolution runs - โœ… **Real-time supervision** of evolution progress - โœ… **Autonomous decision-making** (when to trigger analysis) - โœ… **Zero impact** on evolution if disabled or unavailable --- ## ๐Ÿš€ Quick Start ### Step 1: Start the EV2 Service In a separate terminal: ```bash cd /home/tengxiao/pj/ShinkaEvolve # Start the service uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml ``` The service will start on `http://localhost:8765` by default. ### Step 2: Enable in Your Experiment **Method A: Python Code** ```python from shinka.core import EvolutionRunner, EvolutionConfig from shinka.launch import LocalJobConfig from shinka.database import DatabaseConfig # Create evolution config with eval service enabled evolution_config = EvolutionConfig( num_generations=100, max_parallel_jobs=4, # ... other configs ... # Enable EV2 Eval Service eval_service_url="http://localhost:8765" ) # Run evolution as usual runner = EvolutionRunner( evo_config=evolution_config, job_config=job_config, db_config=db_config ) runner.run() # Service will be notified automatically ``` **Method B: YAML Config** ```yaml # experiment_config.yaml evolution: num_generations: 100 max_parallel_jobs: 4 # ... other configs ... # Enable EV2 Eval Service eval_service_url: "http://localhost:8765" ``` Then load it in your script: ```python import yaml from shinka.core import EvolutionConfig with open("experiment_config.yaml") as f: config_data = yaml.safe_load(f) evo_config = EvolutionConfig(**config_data["evolution"]) ``` ### Step 3: Run Evolution ```bash uv run my/experiment_script.py ``` **What happens:** 1. Evolution runs normally 2. After each generation, ShinkaEvolve notifies the service 3. Service autonomously decides when to trigger agent analysis 4. Agent generates auxiliary metrics (stored in `results_dir/eval_agent_memory/`) 5. Evolution continues unaffected --- ## ๐Ÿ“ File Locations ### Service Configuration - **Config**: `eval_agent/ev2_service_config.yaml` - **Service**: `eval_agent/ev2_service_standalone.py` - **System Prompt**: `eval_agent/ev2_prompt.j2` ### Agent Output During evolution, the agent creates: ``` results_dir/ โ””โ”€โ”€ eval_agent_memory/ โ”œโ”€โ”€ EVAL_AGENTS.md # Analysis reports โ”œโ”€โ”€ aux_metrics.py # Generated auxiliary metrics โ””โ”€โ”€ workspace/ # Agent workspace ``` --- ## โš™๏ธ Service Configuration Edit `eval_agent/ev2_service_config.yaml`: ```yaml # Trigger Strategy trigger_strategy: type: "periodic" # or "plateau" or "mixed" interval: 5 # Trigger every N generations patience: 3 # For plateau detection min_improvement: 0.01 # LLM Configuration llm: model: "vertex_ai/gemini-2.5-flash" api_key_env: "LLM_API_KEY" base_url_env: "LLM_BASE_URL" # Service Settings service: host: "0.0.0.0" port: 8765 ``` ### Trigger Strategies **1. Periodic (Default)** - Triggers every N generations - Simple and predictable - Best for: Regular monitoring ```yaml trigger_strategy: type: "periodic" interval: 5 # Every 5 generations ``` **2. Plateau Detection** - Triggers when improvement stagnates - Adaptive and efficient - Best for: Long runs with varying progress ```yaml trigger_strategy: type: "plateau" patience: 3 # Wait 3 gens without improvement min_improvement: 0.01 # Threshold for "improvement" ``` **3. Mixed (Recommended)** - Combines periodic + plateau - Balanced approach - Best for: Production use ```yaml trigger_strategy: type: "mixed" interval: 10 # Max 10 gens between triggers patience: 3 min_improvement: 0.01 ``` --- ## ๐Ÿ”Œ API Endpoints ### Check Service Status ```bash curl http://localhost:8765/api/v1/status ``` **Response:** ```json { "status": "ready", "total_generations": 15, "agent_triggered_count": 3, "last_generation": 15, "last_trigger_generation": 15, "service_uptime_seconds": 1234.56 } ``` ### Manual Trigger (Optional) Force agent analysis: ```bash curl -X POST http://localhost:8765/api/v1/notify/generation_complete \ -H "Content-Type: application/json" \ -d '{ "generation": 10, "results_dir": "/path/to/results", "primary_score": 0.85 }' ``` --- ## ๐Ÿงช Testing ### Test 1: Basic Integration (No Service) Test backward compatibility: ```bash # Don't start the service uv run eval_agent/test_integration_basic.py ``` **Expected:** - โœ… All tests pass - โœ… Config works correctly - โœ… No errors ### Test 2: Service Connectivity Test with service running: ```bash # Terminal 1: Start service uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml # Terminal 2: Check status curl http://localhost:8765/api/v1/status ``` **Expected:** ```json {"status": "ready", "total_generations": 0, ...} ``` ### Test 3: Simulated Evolution Test notification flow: ```bash # Terminal 1: Service running (see above) # Terminal 2: Simulate generations uv run eval_agent/test_ev2_service.py ``` **Expected:** - โœ… Notifications sent successfully - โœ… Agent triggered based on strategy - โœ… `EVAL_AGENTS.md` created in results directory --- ## ๐Ÿ› Troubleshooting ### Service Not Responding **Symptom:** ``` Failed to notify eval service: Connection refused ``` **Solution:** 1. Check if service is running: `curl http://localhost:8765/api/v1/status` 2. Verify port is not in use: `netstat -tuln | grep 8765` 3. Check service logs for errors **Note:** Evolution continues normally even if service is down. ### Agent Not Triggering **Symptom:** Service receives notifications but agent doesn't run. **Check:** 1. View service logs to see trigger decision 2. Check trigger strategy config (interval might be too high) 3. Verify `primary_evaluator_path` in config ### Memory/Workspace Issues **Symptom:** Agent fails with file not found errors. **Solution:** ```bash # Clean old agent memory rm -rf results_dir/eval_agent_memory # Service will create fresh workspace on next trigger ``` ### Import Errors **Symptom:** ``` ModuleNotFoundError: No module named 'requests' ``` **Solution:** Install missing dependencies: ```bash uv pip install requests fastapi uvicorn pyyaml ``` --- ## ๐Ÿ“Š Monitoring Evolution ### During Evolution **Watch service logs:** ```bash # Service terminal shows: โœ… Generation 5 completed (score: 0.75) ๐ŸŽฏ Trigger condition met (periodic: interval=5) ๐Ÿ”„ Agent working... โœ… Agent completed in 45.2s ๐Ÿ“Š Analysis saved to eval_agent_memory/EVAL_AGENTS.md ``` **Check agent output:** ```bash # View latest analysis cat results_dir/eval_agent_memory/EVAL_AGENTS.md # View generated metrics cat results_dir/eval_agent_memory/aux_metrics.py ``` ### After Evolution **Service state:** ```bash curl http://localhost:8765/api/v1/status | jq ``` **Agent insights:** ```bash # Read all analyses ls -la results_dir/eval_agent_memory/ ``` --- ## ๐Ÿ”’ Security Considerations ### Network Access - Service binds to `0.0.0.0` by default (accessible from network) - For localhost-only: Change to `host: "127.0.0.1"` in config - No authentication required (trusted environment assumed) ### Data Privacy - Service only receives: generation number, score, results_dir path - No code or sensitive data transmitted - All agent memory stored locally ### Resource Limits - Each agent run can take 30-120 seconds - Configure `interval` to control frequency - Monitor disk usage of `eval_agent_memory/workspace/` --- ## ๐Ÿ’ก Best Practices ### 1. Start with Periodic Strategy ```yaml trigger_strategy: type: "periodic" interval: 10 # Not too frequent ``` **Why:** Predictable, easy to debug, good baseline. ### 2. Use Mixed for Long Runs ```yaml trigger_strategy: type: "mixed" interval: 20 patience: 5 min_improvement: 0.02 ``` **Why:** Adapts to evolution dynamics, saves tokens. ### 3. Monitor First Few Triggers - Watch service logs for first 2-3 triggers - Verify agent completes successfully - Check `EVAL_AGENTS.md` quality - Adjust interval if needed ### 4. Clean Memory Between Experiments ```bash # Before new experiment rm -rf old_results_dir/eval_agent_memory ``` **Why:** Prevents cross-contamination of agent insights. ### 5. Keep Service Running - Start service once, reuse for multiple experiments - Service maintains state across runs - Restart only when changing config --- ## ๐ŸŽฏ Example: Complete Workflow ```bash # ======================================== # Terminal 1: Start EV2 Service # ======================================== cd /home/tengxiao/pj/ShinkaEvolve # Edit config if needed vim eval_agent/ev2_service_config.yaml # Start service uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml # Service logs: # ๐Ÿš€ EV2 Service starting... # โœ… Service ready on http://0.0.0.0:8765 # ======================================== # Terminal 2: Run Evolution # ======================================== cd /home/tengxiao/pj/ShinkaEvolve # Create experiment script cat > my/experiment_with_eval_service.py << 'EOF' from shinka.core import EvolutionRunner, EvolutionConfig from shinka.launch import LocalJobConfig from shinka.database import DatabaseConfig evo_config = EvolutionConfig( num_generations=50, max_parallel_jobs=4, eval_service_url="http://localhost:8765", # Enable service # ... other configs ... ) runner = EvolutionRunner(evo_config, job_config, db_config) runner.run() EOF # Run evolution uv run my/experiment_with_eval_service.py # Evolution logs: # Generation 1/50 completed... # Generation 5/50 completed... # (Service notified automatically) # ======================================== # Terminal 3: Monitor (Optional) # ======================================== # Check service status watch -n 5 'curl -s http://localhost:8765/api/v1/status | jq' # Watch agent output watch -n 10 'tail -20 results_dir/eval_agent_memory/EVAL_AGENTS.md' # ======================================== # After Evolution: Review Results # ======================================== # View all agent analyses cat results_dir/eval_agent_memory/EVAL_AGENTS.md # Check generated metrics cat results_dir/eval_agent_memory/aux_metrics.py # Service statistics curl http://localhost:8765/api/v1/status | jq ``` --- ## ๐Ÿ“š Additional Resources - **Integration Plan**: `eval_agent/INTEGRATION_PLAN.md` - **Service Design**: `eval_agent/design_draft/HYBRID_EVAL_SERVICE_DESIGN.md` - **System Prompt**: `eval_agent/ev2_prompt.j2` - **API Documentation**: Visit `http://localhost:8765/docs` when service is running --- ## โ“ FAQ ### Q: Does this slow down evolution? **A:** No. Notifications are fire-and-forget with 1-second timeout. Impact < 5ms per generation. ### Q: What if the service crashes? **A:** Evolution continues unaffected. Service notifications fail silently (debug logs only). ### Q: Can I disable the service mid-evolution? **A:** Yes. Just stop the service. Evolution won't be affected. ### Q: How much does the agent cost (tokens)? **A:** Depends on trigger frequency and LLM model. Example: - Gemini Flash: ~$0.01-0.05 per trigger - Trigger every 10 gens: ~$0.50 for 100-gen run ### Q: Can I use this with SLURM jobs? **A:** Yes. Make sure the service URL is accessible from compute nodes. ### Q: Multiple experiments, one service? **A:** Yes! Service maintains separate state per `results_dir`. --- **Ready to evolve smarter?** ๐Ÿš€ Start the service and add one line to your config: ```python eval_service_url="http://localhost:8765" ```