shinka-backup / eval_agent /design_draft /STEP1_TESTING_GUIDE.md

JustinTX

Add files using upload-large-folder tool

3f6526a verified about 1 month ago

preview code

raw

history blame contribute delete

10 kB

Step 1: Testing Guide for EV2 Service

🎯 What We Built

A minimal HTTP service wrapper around ev2.py that:

✅ Receives generation completion notifications
✅ Autonomously decides when to trigger EV2 agent
✅ Maintains persistent state across generations
✅ Requires minimal changes to ShinkaEvolve

📋 File Overview

eval_agent/
├── ev2_service.py              # The HTTP service (NEW)
├── ev2_service_config.yaml     # Configuration file (NEW)
├── test_ev2_service.py         # Test script (NEW)
├── ev2.py                      # Original agent logic (UNCHANGED)
└── ev2_prompt.j2               # Agent prompt (UNCHANGED)

🚀 Step-by-Step Testing

Step 1: Install Dependencies

cd /home/tengxiao/pj/ShinkaEvolve
source venv/bin/activate

# Install FastAPI and Uvicorn
pip install fastapi uvicorn pyyaml

Step 2: Configure the Service

Edit eval_agent/ev2_service_config.yaml if needed:

experiment:
  results_dir: "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215"
  primary_evaluator: "examples/circle_packing/evaluate_ori.py"

strategy:
  trigger_mode: "periodic"  # Options: always, periodic, plateau, mixed
  trigger_interval: 10       # Run agent every 10 generations

Trigger Modes:

always: Run agent every generation (for testing)
periodic: Run every N generations
plateau: Run when score plateaus
mixed: Run on periodic OR plateau (whichever comes first)

Step 3: Start the Service

Terminal 1 (Service):

cd /home/tengxiao/pj/ShinkaEvolve
source venv/bin/activate

# Start the service
python eval_agent/ev2_service.py --config eval_agent/ev2_service_config.yaml

Expected output:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
2026-02-02 15:30:00 - __main__ - INFO - 🚀 Starting EV2 Evaluation Service...
2026-02-02 15:30:00 - __main__ - INFO - ✅ Service started
2026-02-02 15:30:00 - __main__ - INFO -    Experiment: circle_packing_NO_vision
2026-02-02 15:30:00 - __main__ - INFO -    Trigger mode: periodic
2026-02-02 15:30:00 - __main__ - INFO -    Trigger interval: 10
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8765 (Press CTRL+C to quit)

Step 4: Test the Service

Terminal 2 (Test):

cd /home/tengxiao/pj/ShinkaEvolve
source venv/bin/activate

# Test 1: Check service status
python eval_agent/test_ev2_service.py \
    --results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
    --test-mode status

Expected output:

🔍 Testing service status...
✅ Service is running!
   Uptime: 12.3s
   Trigger mode: periodic
   Trigger interval: 10

# Test 2: Simulate evolution (25 generations)
python eval_agent/test_ev2_service.py \
    --results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
    --test-mode simulate \
    --num-gens 25

Expected output: ``` 🧬 Simulating evolution with 25 generations...

📤 Sending notification: gen=0, score=2.4000 Status: skipped Agent triggered: False Reason: Not yet (last trigger at gen -1) Processing time: 5.2ms

📤 Sending notification: gen=1, score=2.4050 Status: skipped Agent triggered: False Reason: Not yet (last trigger at gen -1) Processing time: 3.1ms

...

📤 Sending notification: gen=10, score=2.4500 Status: success Agent triggered: True Reason: Periodic trigger (interval=10) Processing time: 15234.5ms Insights: 3 found

...

📤 Sending notification: gen=20, score=2.4950 Status: success Agent triggered: True Reason: Periodic trigger (interval=10) Processing time: 12456.7ms Insights: 3 found

====================================================================== ✅ Simulation complete!


### Step 5: Check Results

The service creates/updates:

examples/circle_packing/results/.../ └── eval_agent_memory/ ├── EVAL_AGENTS.md # Updated by agent ├── auxiliary_metrics.py # Created by agent └── service_state.json # Service state (NEW)


Check service state:
```bash
cat examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215/eval_agent_memory/service_state.json

Step 6: Test Manual Trigger (Optional)

# Manually trigger agent for generation 5
python eval_agent/test_ev2_service.py \
    --results-dir "examples/circle_packing/results/results_circle_packing_NO_vision_WITH_refined_aux_20260118_205215" \
    --test-mode manual \
    --generation 5

🔌 API Documentation

The service provides these endpoints:

1. Generation Notification (Main)

curl -X POST http://localhost:8765/api/v1/notify/generation_complete \
  -H "Content-Type: application/json" \
  -d '{
    "generation": 42,
    "results_dir": "/path/to/results",
    "primary_score": 2.5407
  }'

Response:

{
  "status": "success",
  "message": "Periodic trigger (interval=10)",
  "generation": 42,
  "agent_triggered": true,
  "trigger_reason": "Periodic trigger (interval=10)",
  "insights": ["..."],
  "auxiliary_metrics": {...},
  "processing_time_ms": 15234.5
}

2. Service Status

curl http://localhost:8765/api/v1/status

3. Manual Trigger

curl -X POST "http://localhost:8765/api/v1/trigger/manual?generation=10"

4. Interactive Docs

Open in browser: http://localhost:8765/docs

🔧 Integration with ShinkaEvolve

To integrate with ShinkaEvolve, add this to EvolutionRunner:

# shinka/core/runner.py

class EvolutionRunner:
    def __init__(self, config: EvolutionConfig):
        self.config = config
        
        # Initialize eval service client (optional)
        self.eval_service_url = config.eval_service_url if hasattr(config, 'eval_service_url') else None
    
    def _evaluate_generation(self, generation: int, code_path: str, results_dir: str):
        # Run normal evaluation (unchanged)
        results, score = self.scheduler.run(code_path, results_dir)
        
        # Notify eval service (NEW, non-blocking)
        if self.eval_service_url:
            try:
                import requests
                requests.post(
                    f"{self.eval_service_url}/api/v1/notify/generation_complete",
                    json={
                        "generation": generation,
                        "results_dir": results_dir,
                        "primary_score": score
                    },
                    timeout=1  # Short timeout, fire-and-forget
                )
            except Exception as e:
                self.logger.warning(f"Eval service notification failed: {e}")
                # Continue regardless
        
        return results, score

Changes required: ~10 lines of code!

📊 Service Decision Logic

The service decides autonomously when to trigger the agent:

Generation 0:  score=2.40  → SKIP (not yet, interval=10)
Generation 1:  score=2.41  → SKIP
...
Generation 10: score=2.45  → TRIGGER (periodic, interval=10) ✅
Generation 11: score=2.46  → SKIP
...
Generation 20: score=2.49  → TRIGGER (periodic, interval=10) ✅
...

With trigger_mode: "mixed":

Generation 0:  score=2.40  → SKIP
Generation 5:  score=2.40  → TRIGGER (plateau detected!) ✅
Generation 10: score=2.45  → TRIGGER (periodic) ✅
...

🎯 What This Achieves

Before (without service):

# In ShinkaEvolve
for gen in range(num_generations):
    score = evaluate(gen)
    # No auxiliary metrics
    # No intelligent analysis

After (with service):

# In ShinkaEvolve (minimal change)
for gen in range(num_generations):
    score = evaluate(gen)
    notify_service(gen, score)  # ← Just one line!
    
# Service independently:
# - Decides when to analyze
# - Runs EV2 agent
# - Creates auxiliary metrics
# - Accumulates insights

✅ Success Criteria

You've successfully tested Step 1 if:

✅ Service starts without errors
✅ Service responds to notifications
✅ Service correctly skips some generations (based on strategy)
✅ Service triggers agent at the right times
✅ Agent creates/updates EVAL_AGENTS.md and auxiliary_metrics.py
✅ Service state persists (check service_state.json)

🐛 Troubleshooting

Service won't start

Error: ModuleNotFoundError: No module named 'fastapi' Fix: pip install fastapi uvicorn pyyaml

Service starts but test fails

Error: Cannot connect to service Fix: Check if service is running on port 8765. Try: curl http://localhost:8765/

Agent doesn't trigger

Check:

Is agent_enabled: true in config?
Are you sending enough generations? (interval=10 means trigger at gen 10, 20, 30...)
Check service logs in Terminal 1

Agent fails to run

Error in service logs: Primary evaluator not found Fix: Check primary_evaluator path in config is correct

🚀 Next Steps

After Step 1 works:

Step 2: Add intelligent decision-making

More sophisticated trigger strategies
Plateau detection improvements
Alert levels

Step 3: Add persistent memory

SQLite database for history
Metric tracking
Correlation analysis

Step 4: Add MetricUnit management

Object-oriented metrics
Lifecycle management
Validation system

📝 Notes

The service is stateless regarding ShinkaEvolve - it doesn't block or affect the evolution process
If the service crashes, ShinkaEvolve continues normally (fire-and-forget)
Service state is saved to disk, so it survives restarts
All agent logic from ev2.py is preserved and unchanged

Ready to test? Start the service and run the tests! 🚀