EV2 Eval Service - Usage Guide
π Overview
The EV2 Eval Service is now integrated into ShinkaEvolve as an optional, non-blocking feature. It provides:
- β Dynamic metric evolution during code evolution
- β Persistent memory across evolution runs
- β Real-time supervision of evolution progress
- β Autonomous decision-making (when to trigger analysis)
- β Zero impact on evolution if disabled or unavailable
π Quick Start
Step 1: Start the EV2 Service
In a separate terminal:
cd /home/tengxiao/pj/ShinkaEvolve
# Start the service
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml
The service will start on http://localhost:8765 by default.
Step 2: Enable in Your Experiment
Method A: Python Code
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig
# Create evolution config with eval service enabled
evolution_config = EvolutionConfig(
num_generations=100,
max_parallel_jobs=4,
# ... other configs ...
# Enable EV2 Eval Service
eval_service_url="http://localhost:8765"
)
# Run evolution as usual
runner = EvolutionRunner(
evo_config=evolution_config,
job_config=job_config,
db_config=db_config
)
runner.run() # Service will be notified automatically
Method B: YAML Config
# experiment_config.yaml
evolution:
num_generations: 100
max_parallel_jobs: 4
# ... other configs ...
# Enable EV2 Eval Service
eval_service_url: "http://localhost:8765"
Then load it in your script:
import yaml
from shinka.core import EvolutionConfig
with open("experiment_config.yaml") as f:
config_data = yaml.safe_load(f)
evo_config = EvolutionConfig(**config_data["evolution"])
Step 3: Run Evolution
uv run my/experiment_script.py
What happens:
- Evolution runs normally
- After each generation, ShinkaEvolve notifies the service
- Service autonomously decides when to trigger agent analysis
- Agent generates auxiliary metrics (stored in
results_dir/eval_agent_memory/) - Evolution continues unaffected
π File Locations
Service Configuration
- Config:
eval_agent/ev2_service_config.yaml - Service:
eval_agent/ev2_service_standalone.py - System Prompt:
eval_agent/ev2_prompt.j2
Agent Output
During evolution, the agent creates:
results_dir/
βββ eval_agent_memory/
βββ EVAL_AGENTS.md # Analysis reports
βββ aux_metrics.py # Generated auxiliary metrics
βββ workspace/ # Agent workspace
βοΈ Service Configuration
Edit eval_agent/ev2_service_config.yaml:
# Trigger Strategy
trigger_strategy:
type: "periodic" # or "plateau" or "mixed"
interval: 5 # Trigger every N generations
patience: 3 # For plateau detection
min_improvement: 0.01
# LLM Configuration
llm:
model: "vertex_ai/gemini-2.5-flash"
api_key_env: "LLM_API_KEY"
base_url_env: "LLM_BASE_URL"
# Service Settings
service:
host: "0.0.0.0"
port: 8765
Trigger Strategies
1. Periodic (Default)
- Triggers every N generations
- Simple and predictable
- Best for: Regular monitoring
trigger_strategy:
type: "periodic"
interval: 5 # Every 5 generations
2. Plateau Detection
- Triggers when improvement stagnates
- Adaptive and efficient
- Best for: Long runs with varying progress
trigger_strategy:
type: "plateau"
patience: 3 # Wait 3 gens without improvement
min_improvement: 0.01 # Threshold for "improvement"
3. Mixed (Recommended)
- Combines periodic + plateau
- Balanced approach
- Best for: Production use
trigger_strategy:
type: "mixed"
interval: 10 # Max 10 gens between triggers
patience: 3
min_improvement: 0.01
π API Endpoints
Check Service Status
curl http://localhost:8765/api/v1/status
Response:
{
"status": "ready",
"total_generations": 15,
"agent_triggered_count": 3,
"last_generation": 15,
"last_trigger_generation": 15,
"service_uptime_seconds": 1234.56
}
Manual Trigger (Optional)
Force agent analysis:
curl -X POST http://localhost:8765/api/v1/notify/generation_complete \
-H "Content-Type: application/json" \
-d '{
"generation": 10,
"results_dir": "/path/to/results",
"primary_score": 0.85
}'
π§ͺ Testing
Test 1: Basic Integration (No Service)
Test backward compatibility:
# Don't start the service
uv run eval_agent/test_integration_basic.py
Expected:
- β All tests pass
- β Config works correctly
- β No errors
Test 2: Service Connectivity
Test with service running:
# Terminal 1: Start service
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml
# Terminal 2: Check status
curl http://localhost:8765/api/v1/status
Expected:
{"status": "ready", "total_generations": 0, ...}
Test 3: Simulated Evolution
Test notification flow:
# Terminal 1: Service running (see above)
# Terminal 2: Simulate generations
uv run eval_agent/test_ev2_service.py
Expected:
- β Notifications sent successfully
- β Agent triggered based on strategy
- β
EVAL_AGENTS.mdcreated in results directory
π Troubleshooting
Service Not Responding
Symptom:
Failed to notify eval service: Connection refused
Solution:
- Check if service is running:
curl http://localhost:8765/api/v1/status - Verify port is not in use:
netstat -tuln | grep 8765 - Check service logs for errors
Note: Evolution continues normally even if service is down.
Agent Not Triggering
Symptom: Service receives notifications but agent doesn't run.
Check:
- View service logs to see trigger decision
- Check trigger strategy config (interval might be too high)
- Verify
primary_evaluator_pathin config
Memory/Workspace Issues
Symptom: Agent fails with file not found errors.
Solution:
# Clean old agent memory
rm -rf results_dir/eval_agent_memory
# Service will create fresh workspace on next trigger
Import Errors
Symptom:
ModuleNotFoundError: No module named 'requests'
Solution: Install missing dependencies:
uv pip install requests fastapi uvicorn pyyaml
π Monitoring Evolution
During Evolution
Watch service logs:
# Service terminal shows:
β
Generation 5 completed (score: 0.75)
π― Trigger condition met (periodic: interval=5)
π Agent working...
β
Agent completed in 45.2s
π Analysis saved to eval_agent_memory/EVAL_AGENTS.md
Check agent output:
# View latest analysis
cat results_dir/eval_agent_memory/EVAL_AGENTS.md
# View generated metrics
cat results_dir/eval_agent_memory/aux_metrics.py
After Evolution
Service state:
curl http://localhost:8765/api/v1/status | jq
Agent insights:
# Read all analyses
ls -la results_dir/eval_agent_memory/
π Security Considerations
Network Access
- Service binds to
0.0.0.0by default (accessible from network) - For localhost-only: Change to
host: "127.0.0.1"in config - No authentication required (trusted environment assumed)
Data Privacy
- Service only receives: generation number, score, results_dir path
- No code or sensitive data transmitted
- All agent memory stored locally
Resource Limits
- Each agent run can take 30-120 seconds
- Configure
intervalto control frequency - Monitor disk usage of
eval_agent_memory/workspace/
π‘ Best Practices
1. Start with Periodic Strategy
trigger_strategy:
type: "periodic"
interval: 10 # Not too frequent
Why: Predictable, easy to debug, good baseline.
2. Use Mixed for Long Runs
trigger_strategy:
type: "mixed"
interval: 20
patience: 5
min_improvement: 0.02
Why: Adapts to evolution dynamics, saves tokens.
3. Monitor First Few Triggers
- Watch service logs for first 2-3 triggers
- Verify agent completes successfully
- Check
EVAL_AGENTS.mdquality - Adjust interval if needed
4. Clean Memory Between Experiments
# Before new experiment
rm -rf old_results_dir/eval_agent_memory
Why: Prevents cross-contamination of agent insights.
5. Keep Service Running
- Start service once, reuse for multiple experiments
- Service maintains state across runs
- Restart only when changing config
π― Example: Complete Workflow
# ========================================
# Terminal 1: Start EV2 Service
# ========================================
cd /home/tengxiao/pj/ShinkaEvolve
# Edit config if needed
vim eval_agent/ev2_service_config.yaml
# Start service
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml
# Service logs:
# π EV2 Service starting...
# β
Service ready on http://0.0.0.0:8765
# ========================================
# Terminal 2: Run Evolution
# ========================================
cd /home/tengxiao/pj/ShinkaEvolve
# Create experiment script
cat > my/experiment_with_eval_service.py << 'EOF'
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig
evo_config = EvolutionConfig(
num_generations=50,
max_parallel_jobs=4,
eval_service_url="http://localhost:8765", # Enable service
# ... other configs ...
)
runner = EvolutionRunner(evo_config, job_config, db_config)
runner.run()
EOF
# Run evolution
uv run my/experiment_with_eval_service.py
# Evolution logs:
# Generation 1/50 completed...
# Generation 5/50 completed...
# (Service notified automatically)
# ========================================
# Terminal 3: Monitor (Optional)
# ========================================
# Check service status
watch -n 5 'curl -s http://localhost:8765/api/v1/status | jq'
# Watch agent output
watch -n 10 'tail -20 results_dir/eval_agent_memory/EVAL_AGENTS.md'
# ========================================
# After Evolution: Review Results
# ========================================
# View all agent analyses
cat results_dir/eval_agent_memory/EVAL_AGENTS.md
# Check generated metrics
cat results_dir/eval_agent_memory/aux_metrics.py
# Service statistics
curl http://localhost:8765/api/v1/status | jq
π Additional Resources
- Integration Plan:
eval_agent/INTEGRATION_PLAN.md - Service Design:
eval_agent/design_draft/HYBRID_EVAL_SERVICE_DESIGN.md - System Prompt:
eval_agent/ev2_prompt.j2 - API Documentation: Visit
http://localhost:8765/docswhen service is running
β FAQ
Q: Does this slow down evolution?
A: No. Notifications are fire-and-forget with 1-second timeout. Impact < 5ms per generation.
Q: What if the service crashes?
A: Evolution continues unaffected. Service notifications fail silently (debug logs only).
Q: Can I disable the service mid-evolution?
A: Yes. Just stop the service. Evolution won't be affected.
Q: How much does the agent cost (tokens)?
A: Depends on trigger frequency and LLM model. Example:
- Gemini Flash: ~$0.01-0.05 per trigger
- Trigger every 10 gens: ~$0.50 for 100-gen run
Q: Can I use this with SLURM jobs?
A: Yes. Make sure the service URL is accessible from compute nodes.
Q: Multiple experiments, one service?
A: Yes! Service maintains separate state per results_dir.
Ready to evolve smarter? π
Start the service and add one line to your config:
eval_service_url="http://localhost:8765"