| # EV2 Eval Service - Usage Guide |
|
|
| ## π Overview |
|
|
| The EV2 Eval Service is now integrated into ShinkaEvolve as an **optional, non-blocking** feature. It provides: |
|
|
| - β
**Dynamic metric evolution** during code evolution |
| - β
**Persistent memory** across evolution runs |
| - β
**Real-time supervision** of evolution progress |
| - β
**Autonomous decision-making** (when to trigger analysis) |
| - β
**Zero impact** on evolution if disabled or unavailable |
|
|
| --- |
|
|
| ## π Quick Start |
|
|
| ### Step 1: Start the EV2 Service |
|
|
| In a separate terminal: |
|
|
| ```bash |
| cd /home/tengxiao/pj/ShinkaEvolve |
| |
| # Start the service |
| uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml |
| ``` |
|
|
| The service will start on `http://localhost:8765` by default. |
|
|
| ### Step 2: Enable in Your Experiment |
|
|
| **Method A: Python Code** |
|
|
| ```python |
| from shinka.core import EvolutionRunner, EvolutionConfig |
| from shinka.launch import LocalJobConfig |
| from shinka.database import DatabaseConfig |
| |
| # Create evolution config with eval service enabled |
| evolution_config = EvolutionConfig( |
| num_generations=100, |
| max_parallel_jobs=4, |
| # ... other configs ... |
| |
| # Enable EV2 Eval Service |
| eval_service_url="http://localhost:8765" |
| ) |
| |
| # Run evolution as usual |
| runner = EvolutionRunner( |
| evo_config=evolution_config, |
| job_config=job_config, |
| db_config=db_config |
| ) |
| |
| runner.run() # Service will be notified automatically |
| ``` |
|
|
| **Method B: YAML Config** |
|
|
| ```yaml |
| # experiment_config.yaml |
| evolution: |
| num_generations: 100 |
| max_parallel_jobs: 4 |
| # ... other configs ... |
| |
| # Enable EV2 Eval Service |
| eval_service_url: "http://localhost:8765" |
| ``` |
|
|
| Then load it in your script: |
|
|
| ```python |
| import yaml |
| from shinka.core import EvolutionConfig |
| |
| with open("experiment_config.yaml") as f: |
| config_data = yaml.safe_load(f) |
| |
| evo_config = EvolutionConfig(**config_data["evolution"]) |
| ``` |
|
|
| ### Step 3: Run Evolution |
|
|
| ```bash |
| uv run my/experiment_script.py |
| ``` |
|
|
| **What happens:** |
| 1. Evolution runs normally |
| 2. After each generation, ShinkaEvolve notifies the service |
| 3. Service autonomously decides when to trigger agent analysis |
| 4. Agent generates auxiliary metrics (stored in `results_dir/eval_agent_memory/`) |
| 5. Evolution continues unaffected |
|
|
| --- |
|
|
| ## π File Locations |
|
|
| ### Service Configuration |
|
|
| - **Config**: `eval_agent/ev2_service_config.yaml` |
| - **Service**: `eval_agent/ev2_service_standalone.py` |
| - **System Prompt**: `eval_agent/ev2_prompt.j2` |
|
|
| ### Agent Output |
|
|
| During evolution, the agent creates: |
|
|
| ``` |
| results_dir/ |
| βββ eval_agent_memory/ |
| βββ EVAL_AGENTS.md # Analysis reports |
| βββ aux_metrics.py # Generated auxiliary metrics |
| βββ workspace/ # Agent workspace |
| ``` |
|
|
| --- |
|
|
| ## βοΈ Service Configuration |
|
|
| Edit `eval_agent/ev2_service_config.yaml`: |
|
|
| ```yaml |
| # Trigger Strategy |
| trigger_strategy: |
| type: "periodic" # or "plateau" or "mixed" |
| interval: 5 # Trigger every N generations |
| patience: 3 # For plateau detection |
| min_improvement: 0.01 |
| |
| # LLM Configuration |
| llm: |
| model: "vertex_ai/gemini-2.5-flash" |
| api_key_env: "LLM_API_KEY" |
| base_url_env: "LLM_BASE_URL" |
| |
| # Service Settings |
| service: |
| host: "0.0.0.0" |
| port: 8765 |
| ``` |
|
|
| ### Trigger Strategies |
|
|
| **1. Periodic (Default)** |
| - Triggers every N generations |
| - Simple and predictable |
| - Best for: Regular monitoring |
|
|
| ```yaml |
| trigger_strategy: |
| type: "periodic" |
| interval: 5 # Every 5 generations |
| ``` |
|
|
| **2. Plateau Detection** |
| - Triggers when improvement stagnates |
| - Adaptive and efficient |
| - Best for: Long runs with varying progress |
|
|
| ```yaml |
| trigger_strategy: |
| type: "plateau" |
| patience: 3 # Wait 3 gens without improvement |
| min_improvement: 0.01 # Threshold for "improvement" |
| ``` |
|
|
| **3. Mixed (Recommended)** |
| - Combines periodic + plateau |
| - Balanced approach |
| - Best for: Production use |
|
|
| ```yaml |
| trigger_strategy: |
| type: "mixed" |
| interval: 10 # Max 10 gens between triggers |
| patience: 3 |
| min_improvement: 0.01 |
| ``` |
|
|
| --- |
|
|
| ## π API Endpoints |
|
|
| ### Check Service Status |
|
|
| ```bash |
| curl http://localhost:8765/api/v1/status |
| ``` |
|
|
| **Response:** |
| ```json |
| { |
| "status": "ready", |
| "total_generations": 15, |
| "agent_triggered_count": 3, |
| "last_generation": 15, |
| "last_trigger_generation": 15, |
| "service_uptime_seconds": 1234.56 |
| } |
| ``` |
|
|
| ### Manual Trigger (Optional) |
|
|
| Force agent analysis: |
|
|
| ```bash |
| curl -X POST http://localhost:8765/api/v1/notify/generation_complete \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "generation": 10, |
| "results_dir": "/path/to/results", |
| "primary_score": 0.85 |
| }' |
| ``` |
|
|
| --- |
|
|
| ## π§ͺ Testing |
|
|
| ### Test 1: Basic Integration (No Service) |
|
|
| Test backward compatibility: |
|
|
| ```bash |
| # Don't start the service |
| uv run eval_agent/test_integration_basic.py |
| ``` |
|
|
| **Expected:** |
| - β
All tests pass |
| - β
Config works correctly |
| - β
No errors |
|
|
| ### Test 2: Service Connectivity |
|
|
| Test with service running: |
|
|
| ```bash |
| # Terminal 1: Start service |
| uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml |
| |
| # Terminal 2: Check status |
| curl http://localhost:8765/api/v1/status |
| ``` |
|
|
| **Expected:** |
| ```json |
| {"status": "ready", "total_generations": 0, ...} |
| ``` |
|
|
| ### Test 3: Simulated Evolution |
|
|
| Test notification flow: |
|
|
| ```bash |
| # Terminal 1: Service running (see above) |
| |
| # Terminal 2: Simulate generations |
| uv run eval_agent/test_ev2_service.py |
| ``` |
|
|
| **Expected:** |
| - β
Notifications sent successfully |
| - β
Agent triggered based on strategy |
| - β
`EVAL_AGENTS.md` created in results directory |
|
|
| --- |
|
|
| ## π Troubleshooting |
|
|
| ### Service Not Responding |
|
|
| **Symptom:** |
| ``` |
| Failed to notify eval service: Connection refused |
| ``` |
|
|
| **Solution:** |
| 1. Check if service is running: `curl http://localhost:8765/api/v1/status` |
| 2. Verify port is not in use: `netstat -tuln | grep 8765` |
| 3. Check service logs for errors |
|
|
| **Note:** Evolution continues normally even if service is down. |
|
|
| ### Agent Not Triggering |
|
|
| **Symptom:** Service receives notifications but agent doesn't run. |
|
|
| **Check:** |
| 1. View service logs to see trigger decision |
| 2. Check trigger strategy config (interval might be too high) |
| 3. Verify `primary_evaluator_path` in config |
|
|
| ### Memory/Workspace Issues |
|
|
| **Symptom:** Agent fails with file not found errors. |
|
|
| **Solution:** |
| ```bash |
| # Clean old agent memory |
| rm -rf results_dir/eval_agent_memory |
| |
| # Service will create fresh workspace on next trigger |
| ``` |
|
|
| ### Import Errors |
|
|
| **Symptom:** |
| ``` |
| ModuleNotFoundError: No module named 'requests' |
| ``` |
|
|
| **Solution:** |
| Install missing dependencies: |
| ```bash |
| uv pip install requests fastapi uvicorn pyyaml |
| ``` |
|
|
| --- |
|
|
| ## π Monitoring Evolution |
|
|
| ### During Evolution |
|
|
| **Watch service logs:** |
| ```bash |
| # Service terminal shows: |
| β
Generation 5 completed (score: 0.75) |
| π― Trigger condition met (periodic: interval=5) |
| π Agent working... |
| β
Agent completed in 45.2s |
| π Analysis saved to eval_agent_memory/EVAL_AGENTS.md |
| ``` |
|
|
| **Check agent output:** |
| ```bash |
| # View latest analysis |
| cat results_dir/eval_agent_memory/EVAL_AGENTS.md |
| |
| # View generated metrics |
| cat results_dir/eval_agent_memory/aux_metrics.py |
| ``` |
|
|
| ### After Evolution |
|
|
| **Service state:** |
| ```bash |
| curl http://localhost:8765/api/v1/status | jq |
| ``` |
|
|
| **Agent insights:** |
| ```bash |
| # Read all analyses |
| ls -la results_dir/eval_agent_memory/ |
| ``` |
|
|
| --- |
|
|
| ## π Security Considerations |
|
|
| ### Network Access |
|
|
| - Service binds to `0.0.0.0` by default (accessible from network) |
| - For localhost-only: Change to `host: "127.0.0.1"` in config |
| - No authentication required (trusted environment assumed) |
|
|
| ### Data Privacy |
|
|
| - Service only receives: generation number, score, results_dir path |
| - No code or sensitive data transmitted |
| - All agent memory stored locally |
| |
| ### Resource Limits |
| |
| - Each agent run can take 30-120 seconds |
| - Configure `interval` to control frequency |
| - Monitor disk usage of `eval_agent_memory/workspace/` |
| |
| --- |
| |
| ## π‘ Best Practices |
| |
| ### 1. Start with Periodic Strategy |
| |
| ```yaml |
| trigger_strategy: |
| type: "periodic" |
| interval: 10 # Not too frequent |
| ``` |
| |
| **Why:** Predictable, easy to debug, good baseline. |
| |
| ### 2. Use Mixed for Long Runs |
| |
| ```yaml |
| trigger_strategy: |
| type: "mixed" |
| interval: 20 |
| patience: 5 |
| min_improvement: 0.02 |
| ``` |
| |
| **Why:** Adapts to evolution dynamics, saves tokens. |
| |
| ### 3. Monitor First Few Triggers |
| |
| - Watch service logs for first 2-3 triggers |
| - Verify agent completes successfully |
| - Check `EVAL_AGENTS.md` quality |
| - Adjust interval if needed |
| |
| ### 4. Clean Memory Between Experiments |
| |
| ```bash |
| # Before new experiment |
| rm -rf old_results_dir/eval_agent_memory |
| ``` |
| |
| **Why:** Prevents cross-contamination of agent insights. |
| |
| ### 5. Keep Service Running |
| |
| - Start service once, reuse for multiple experiments |
| - Service maintains state across runs |
| - Restart only when changing config |
| |
| --- |
| |
| ## π― Example: Complete Workflow |
| |
| ```bash |
| # ======================================== |
| # Terminal 1: Start EV2 Service |
| # ======================================== |
| cd /home/tengxiao/pj/ShinkaEvolve |
|
|
| # Edit config if needed |
| vim eval_agent/ev2_service_config.yaml |
| |
| # Start service |
| uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml |
| |
| # Service logs: |
| # π EV2 Service starting... |
| # β
Service ready on http://0.0.0.0:8765 |
| |
| |
| # ======================================== |
| # Terminal 2: Run Evolution |
| # ======================================== |
| cd /home/tengxiao/pj/ShinkaEvolve |
| |
| # Create experiment script |
| cat > my/experiment_with_eval_service.py << 'EOF' |
| from shinka.core import EvolutionRunner, EvolutionConfig |
| from shinka.launch import LocalJobConfig |
| from shinka.database import DatabaseConfig |
|
|
| evo_config = EvolutionConfig( |
| num_generations=50, |
| max_parallel_jobs=4, |
| eval_service_url="http://localhost:8765", # Enable service |
| # ... other configs ... |
| ) |
| |
| runner = EvolutionRunner(evo_config, job_config, db_config) |
| runner.run() |
| EOF |
| |
| # Run evolution |
| uv run my/experiment_with_eval_service.py |
|
|
| # Evolution logs: |
| # Generation 1/50 completed... |
| # Generation 5/50 completed... |
| # (Service notified automatically) |
|
|
|
|
| # ======================================== |
| # Terminal 3: Monitor (Optional) |
| # ======================================== |
|
|
| # Check service status |
| watch -n 5 'curl -s http://localhost:8765/api/v1/status | jq' |
|
|
| # Watch agent output |
| watch -n 10 'tail -20 results_dir/eval_agent_memory/EVAL_AGENTS.md' |
|
|
|
|
| # ======================================== |
| # After Evolution: Review Results |
| # ======================================== |
|
|
| # View all agent analyses |
| cat results_dir/eval_agent_memory/EVAL_AGENTS.md |
|
|
| # Check generated metrics |
| cat results_dir/eval_agent_memory/aux_metrics.py |
|
|
| # Service statistics |
| curl http://localhost:8765/api/v1/status | jq |
| ``` |
| |
| --- |
| |
| ## π Additional Resources |
| |
| - **Integration Plan**: `eval_agent/INTEGRATION_PLAN.md` |
| - **Service Design**: `eval_agent/design_draft/HYBRID_EVAL_SERVICE_DESIGN.md` |
| - **System Prompt**: `eval_agent/ev2_prompt.j2` |
| - **API Documentation**: Visit `http://localhost:8765/docs` when service is running |
| |
| --- |
| |
| ## β FAQ |
| |
| ### Q: Does this slow down evolution? |
| |
| **A:** No. Notifications are fire-and-forget with 1-second timeout. Impact < 5ms per generation. |
| |
| ### Q: What if the service crashes? |
| |
| **A:** Evolution continues unaffected. Service notifications fail silently (debug logs only). |
| |
| ### Q: Can I disable the service mid-evolution? |
| |
| **A:** Yes. Just stop the service. Evolution won't be affected. |
| |
| ### Q: How much does the agent cost (tokens)? |
| |
| **A:** Depends on trigger frequency and LLM model. Example: |
| - Gemini Flash: ~$0.01-0.05 per trigger |
| - Trigger every 10 gens: ~$0.50 for 100-gen run |
| |
| ### Q: Can I use this with SLURM jobs? |
| |
| **A:** Yes. Make sure the service URL is accessible from compute nodes. |
| |
| ### Q: Multiple experiments, one service? |
| |
| **A:** Yes! Service maintains separate state per `results_dir`. |
| |
| --- |
| |
| **Ready to evolve smarter?** π |
| |
| Start the service and add one line to your config: |
| ```python |
| eval_service_url="http://localhost:8765" |
| ``` |
| |