JustinTX's picture
Add files using upload-large-folder tool
3f6526a verified

EV2 Eval Service - Usage Guide

πŸ“– Overview

The EV2 Eval Service is now integrated into ShinkaEvolve as an optional, non-blocking feature. It provides:

  • βœ… Dynamic metric evolution during code evolution
  • βœ… Persistent memory across evolution runs
  • βœ… Real-time supervision of evolution progress
  • βœ… Autonomous decision-making (when to trigger analysis)
  • βœ… Zero impact on evolution if disabled or unavailable

πŸš€ Quick Start

Step 1: Start the EV2 Service

In a separate terminal:

cd /home/tengxiao/pj/ShinkaEvolve

# Start the service
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml

The service will start on http://localhost:8765 by default.

Step 2: Enable in Your Experiment

Method A: Python Code

from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig

# Create evolution config with eval service enabled
evolution_config = EvolutionConfig(
    num_generations=100,
    max_parallel_jobs=4,
    # ... other configs ...
    
    # Enable EV2 Eval Service
    eval_service_url="http://localhost:8765"
)

# Run evolution as usual
runner = EvolutionRunner(
    evo_config=evolution_config,
    job_config=job_config,
    db_config=db_config
)

runner.run()  # Service will be notified automatically

Method B: YAML Config

# experiment_config.yaml
evolution:
  num_generations: 100
  max_parallel_jobs: 4
  # ... other configs ...
  
  # Enable EV2 Eval Service
  eval_service_url: "http://localhost:8765"

Then load it in your script:

import yaml
from shinka.core import EvolutionConfig

with open("experiment_config.yaml") as f:
    config_data = yaml.safe_load(f)

evo_config = EvolutionConfig(**config_data["evolution"])

Step 3: Run Evolution

uv run my/experiment_script.py

What happens:

  1. Evolution runs normally
  2. After each generation, ShinkaEvolve notifies the service
  3. Service autonomously decides when to trigger agent analysis
  4. Agent generates auxiliary metrics (stored in results_dir/eval_agent_memory/)
  5. Evolution continues unaffected

πŸ“ File Locations

Service Configuration

  • Config: eval_agent/ev2_service_config.yaml
  • Service: eval_agent/ev2_service_standalone.py
  • System Prompt: eval_agent/ev2_prompt.j2

Agent Output

During evolution, the agent creates:

results_dir/
  └── eval_agent_memory/
      β”œβ”€β”€ EVAL_AGENTS.md          # Analysis reports
      β”œβ”€β”€ aux_metrics.py          # Generated auxiliary metrics
      └── workspace/              # Agent workspace

βš™οΈ Service Configuration

Edit eval_agent/ev2_service_config.yaml:

# Trigger Strategy
trigger_strategy:
  type: "periodic"     # or "plateau" or "mixed"
  interval: 5          # Trigger every N generations
  patience: 3          # For plateau detection
  min_improvement: 0.01

# LLM Configuration
llm:
  model: "vertex_ai/gemini-2.5-flash"
  api_key_env: "LLM_API_KEY"
  base_url_env: "LLM_BASE_URL"

# Service Settings
service:
  host: "0.0.0.0"
  port: 8765

Trigger Strategies

1. Periodic (Default)

  • Triggers every N generations
  • Simple and predictable
  • Best for: Regular monitoring
trigger_strategy:
  type: "periodic"
  interval: 5  # Every 5 generations

2. Plateau Detection

  • Triggers when improvement stagnates
  • Adaptive and efficient
  • Best for: Long runs with varying progress
trigger_strategy:
  type: "plateau"
  patience: 3           # Wait 3 gens without improvement
  min_improvement: 0.01 # Threshold for "improvement"

3. Mixed (Recommended)

  • Combines periodic + plateau
  • Balanced approach
  • Best for: Production use
trigger_strategy:
  type: "mixed"
  interval: 10         # Max 10 gens between triggers
  patience: 3
  min_improvement: 0.01

πŸ”Œ API Endpoints

Check Service Status

curl http://localhost:8765/api/v1/status

Response:

{
  "status": "ready",
  "total_generations": 15,
  "agent_triggered_count": 3,
  "last_generation": 15,
  "last_trigger_generation": 15,
  "service_uptime_seconds": 1234.56
}

Manual Trigger (Optional)

Force agent analysis:

curl -X POST http://localhost:8765/api/v1/notify/generation_complete \
  -H "Content-Type: application/json" \
  -d '{
    "generation": 10,
    "results_dir": "/path/to/results",
    "primary_score": 0.85
  }'

πŸ§ͺ Testing

Test 1: Basic Integration (No Service)

Test backward compatibility:

# Don't start the service
uv run eval_agent/test_integration_basic.py

Expected:

  • βœ… All tests pass
  • βœ… Config works correctly
  • βœ… No errors

Test 2: Service Connectivity

Test with service running:

# Terminal 1: Start service
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml

# Terminal 2: Check status
curl http://localhost:8765/api/v1/status

Expected:

{"status": "ready", "total_generations": 0, ...}

Test 3: Simulated Evolution

Test notification flow:

# Terminal 1: Service running (see above)

# Terminal 2: Simulate generations
uv run eval_agent/test_ev2_service.py

Expected:

  • βœ… Notifications sent successfully
  • βœ… Agent triggered based on strategy
  • βœ… EVAL_AGENTS.md created in results directory

πŸ› Troubleshooting

Service Not Responding

Symptom:

Failed to notify eval service: Connection refused

Solution:

  1. Check if service is running: curl http://localhost:8765/api/v1/status
  2. Verify port is not in use: netstat -tuln | grep 8765
  3. Check service logs for errors

Note: Evolution continues normally even if service is down.

Agent Not Triggering

Symptom: Service receives notifications but agent doesn't run.

Check:

  1. View service logs to see trigger decision
  2. Check trigger strategy config (interval might be too high)
  3. Verify primary_evaluator_path in config

Memory/Workspace Issues

Symptom: Agent fails with file not found errors.

Solution:

# Clean old agent memory
rm -rf results_dir/eval_agent_memory

# Service will create fresh workspace on next trigger

Import Errors

Symptom:

ModuleNotFoundError: No module named 'requests'

Solution: Install missing dependencies:

uv pip install requests fastapi uvicorn pyyaml

πŸ“Š Monitoring Evolution

During Evolution

Watch service logs:

# Service terminal shows:
βœ… Generation 5 completed (score: 0.75)
🎯 Trigger condition met (periodic: interval=5)
πŸ”„ Agent working...
βœ… Agent completed in 45.2s
πŸ“Š Analysis saved to eval_agent_memory/EVAL_AGENTS.md

Check agent output:

# View latest analysis
cat results_dir/eval_agent_memory/EVAL_AGENTS.md

# View generated metrics
cat results_dir/eval_agent_memory/aux_metrics.py

After Evolution

Service state:

curl http://localhost:8765/api/v1/status | jq

Agent insights:

# Read all analyses
ls -la results_dir/eval_agent_memory/

πŸ”’ Security Considerations

Network Access

  • Service binds to 0.0.0.0 by default (accessible from network)
  • For localhost-only: Change to host: "127.0.0.1" in config
  • No authentication required (trusted environment assumed)

Data Privacy

  • Service only receives: generation number, score, results_dir path
  • No code or sensitive data transmitted
  • All agent memory stored locally

Resource Limits

  • Each agent run can take 30-120 seconds
  • Configure interval to control frequency
  • Monitor disk usage of eval_agent_memory/workspace/

πŸ’‘ Best Practices

1. Start with Periodic Strategy

trigger_strategy:
  type: "periodic"
  interval: 10  # Not too frequent

Why: Predictable, easy to debug, good baseline.

2. Use Mixed for Long Runs

trigger_strategy:
  type: "mixed"
  interval: 20
  patience: 5
  min_improvement: 0.02

Why: Adapts to evolution dynamics, saves tokens.

3. Monitor First Few Triggers

  • Watch service logs for first 2-3 triggers
  • Verify agent completes successfully
  • Check EVAL_AGENTS.md quality
  • Adjust interval if needed

4. Clean Memory Between Experiments

# Before new experiment
rm -rf old_results_dir/eval_agent_memory

Why: Prevents cross-contamination of agent insights.

5. Keep Service Running

  • Start service once, reuse for multiple experiments
  • Service maintains state across runs
  • Restart only when changing config

🎯 Example: Complete Workflow

# ========================================
# Terminal 1: Start EV2 Service
# ========================================
cd /home/tengxiao/pj/ShinkaEvolve

# Edit config if needed
vim eval_agent/ev2_service_config.yaml

# Start service
uv run eval_agent/ev2_service_standalone.py --config eval_agent/ev2_service_config.yaml

# Service logs:
# πŸš€ EV2 Service starting...
# βœ… Service ready on http://0.0.0.0:8765


# ========================================
# Terminal 2: Run Evolution
# ========================================
cd /home/tengxiao/pj/ShinkaEvolve

# Create experiment script
cat > my/experiment_with_eval_service.py << 'EOF'
from shinka.core import EvolutionRunner, EvolutionConfig
from shinka.launch import LocalJobConfig
from shinka.database import DatabaseConfig

evo_config = EvolutionConfig(
    num_generations=50,
    max_parallel_jobs=4,
    eval_service_url="http://localhost:8765",  # Enable service
    # ... other configs ...
)

runner = EvolutionRunner(evo_config, job_config, db_config)
runner.run()
EOF

# Run evolution
uv run my/experiment_with_eval_service.py

# Evolution logs:
# Generation 1/50 completed...
# Generation 5/50 completed...
# (Service notified automatically)


# ========================================
# Terminal 3: Monitor (Optional)
# ========================================

# Check service status
watch -n 5 'curl -s http://localhost:8765/api/v1/status | jq'

# Watch agent output
watch -n 10 'tail -20 results_dir/eval_agent_memory/EVAL_AGENTS.md'


# ========================================
# After Evolution: Review Results
# ========================================

# View all agent analyses
cat results_dir/eval_agent_memory/EVAL_AGENTS.md

# Check generated metrics
cat results_dir/eval_agent_memory/aux_metrics.py

# Service statistics
curl http://localhost:8765/api/v1/status | jq

πŸ“š Additional Resources

  • Integration Plan: eval_agent/INTEGRATION_PLAN.md
  • Service Design: eval_agent/design_draft/HYBRID_EVAL_SERVICE_DESIGN.md
  • System Prompt: eval_agent/ev2_prompt.j2
  • API Documentation: Visit http://localhost:8765/docs when service is running

❓ FAQ

Q: Does this slow down evolution?

A: No. Notifications are fire-and-forget with 1-second timeout. Impact < 5ms per generation.

Q: What if the service crashes?

A: Evolution continues unaffected. Service notifications fail silently (debug logs only).

Q: Can I disable the service mid-evolution?

A: Yes. Just stop the service. Evolution won't be affected.

Q: How much does the agent cost (tokens)?

A: Depends on trigger frequency and LLM model. Example:

  • Gemini Flash: ~$0.01-0.05 per trigger
  • Trigger every 10 gens: ~$0.50 for 100-gen run

Q: Can I use this with SLURM jobs?

A: Yes. Make sure the service URL is accessible from compute nodes.

Q: Multiple experiments, one service?

A: Yes! Service maintains separate state per results_dir.


Ready to evolve smarter? πŸš€

Start the service and add one line to your config:

eval_service_url="http://localhost:8765"