# ShadowOps OpenEnv Loop Evaluation - Policy evaluated: q_aware - Baseline policy for comparison: heuristic - Episodes: 50 - Episode max length: 5 - Seed: 42 - Run label: full_eval_50_episodes - Scope note: FULL EVAL: 50 episodes (judge-facing run size). ## Validation Note - **Validated Locally**: The Q-aware fallback and OpenEnv safety evaluation are valid and remain the basis for current claims. - **Pending GPU Validation**: The trained-model evaluation has not yet produced valid metrics in the standard local environment. Trained-model evidence will be added only when GPU evaluation artifacts exist. No trained-model comparison claims are made at this stage. ## Core Metrics | Metric | Value | | --- | ---: | | Total steps | 250 | | Malicious steps | 139 | | Benign steps | 111 | | Accuracy | 0.164 | | Unsafe allow rate (malicious-only) | 0.000 | | Safe block rate (benign blocked/forked/quarantined) | 1.000 | | Average confidence | 0.428 | | Mean reward per step | 6.896 | | Step pass count | 139 | | Step fail count | 111 | ## Reward Summary - Per-step reward mean/median/std: 6.896 / 2.000 / 31.080 - Per-episode reward mean/median/std: 34.480 / 41.500 / 76.698 - Per-episode reward min/max: -80.000 / 193.000 ## Before vs After Aggregate | Metric | Baseline | Target | Delta (target-baseline) | | --- | ---: | ---: | ---: | | Unsafe allow rate | 0.403 | 0.000 | -0.403 | | Safe block rate | 0.270 | 1.000 | +0.730 | | Average confidence | 0.775 | 0.428 | -0.347 | | Mean reward/step | 33.524 | 6.896 | -26.628 | | Safety Adjusted Score | 8.957 | -5.324 | -14.281 | ## Safety vs Reward Trade-Off - **Note on Reward vs Safety**: The `heuristic` baseline may occasionally have a higher `mean_reward_per_step` due to faster resolution times. - However, **Q-aware is considered safer** when its `unsafe_allow_rate = 0.000`. Unsafe allow is the primary failure mode in security automation and carries severe negative business impact. - Lower confidence scores in Q-aware do not necessarily mean failure; they often reflect **cautious uncertainty** on ambiguous payloads, which correctly triggers QUARANTINE instead of false-positive blocks or dangerous allows. ## Representative Behavior See `openenv_behavior_comparison.md` for 5-10 structured before/after scenarios. | Scenario | Baseline | Target | Baseline correct | Target correct | | --- | --- | --- | --- | --- | | AWS::UPDATE_IAM | BLOCK | QUARANTINE | False | True | | AWS::UPDATE_SECURITY_GROUP | BLOCK | QUARANTINE | False | True | | AWS::RESIZE_INSTANCE | BLOCK | QUARANTINE | False | True | | AWS::ENABLE_LOGGING | BLOCK | QUARANTINE | False | True | | AWS::UPDATE_IAM | BLOCK | QUARANTINE | False | True | | AWS::UPDATE_IAM | BLOCK | QUARANTINE | False | True | | AWS::RESIZE_INSTANCE | BLOCK | QUARANTINE | False | True | | AWS::ROTATE_KEYS | FORK | QUARANTINE | False | True | ## Episode Summary | Episode | Reward | Steps | Unsafe allows | Safe blocks | Pass | Fail | Final SOC | Final GitHub | Final AWS | | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | 1 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 92 | | 2 | 70.000 | 5 | 0 | 5 | 0 | 5 | 78 | 100 | 100 | | 3 | 143.000 | 5 | 0 | 4 | 1 | 4 | 100 | 92 | 100 | | 4 | -5.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 5 | 51.000 | 5 | 0 | 4 | 1 | 4 | 84 | 96 | 100 | | 6 | 51.000 | 5 | 0 | 1 | 4 | 1 | 100 | 92 | 100 | | 7 | 70.000 | 5 | 0 | 5 | 0 | 5 | 100 | 86 | 92 | | 8 | -38.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 | | 9 | -80.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 10 | 193.000 | 5 | 0 | 1 | 4 | 1 | 100 | 98 | 100 | | 11 | 121.000 | 5 | 0 | 4 | 1 | 4 | 88 | 92 | 100 | | 12 | 193.000 | 5 | 0 | 1 | 4 | 1 | 100 | 98 | 100 | | 13 | 124.000 | 5 | 0 | 3 | 2 | 3 | 100 | 85 | 100 | | 14 | -80.000 | 5 | 0 | 1 | 4 | 1 | 100 | 98 | 100 | | 15 | 81.000 | 5 | 0 | 2 | 3 | 2 | 92 | 100 | 92 | | 16 | -38.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 | | 17 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 92 | | 18 | 91.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 | | 19 | -28.000 | 5 | 0 | 5 | 0 | 5 | 80 | 92 | 100 | | 20 | 91.000 | 5 | 0 | 4 | 1 | 4 | 88 | 92 | 100 | | 21 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 92 | | 22 | 28.000 | 5 | 0 | 5 | 0 | 5 | 100 | 90 | 100 | | 23 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 92 | 100 | | 24 | -38.000 | 5 | 0 | 5 | 0 | 5 | 100 | 78 | 100 | | 25 | 58.000 | 5 | 0 | 2 | 3 | 2 | 92 | 100 | 92 | | 26 | 70.000 | 5 | 0 | 5 | 0 | 5 | 92 | 92 | 100 | | 27 | 58.000 | 5 | 0 | 5 | 0 | 5 | 86 | 100 | 92 | | 28 | 58.000 | 5 | 0 | 5 | 0 | 5 | 100 | 74 | 92 | | 29 | 91.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 | | 30 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 92 | 100 | | 31 | -5.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 32 | 48.000 | 5 | 0 | 3 | 2 | 3 | 92 | 84 | 100 | | 33 | 124.000 | 5 | 0 | 3 | 2 | 3 | 94 | 100 | 100 | | 34 | 2.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 92 | | 35 | -5.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 36 | 18.000 | 5 | 0 | 5 | 0 | 5 | 88 | 84 | 100 | | 37 | -28.000 | 5 | 0 | 5 | 0 | 5 | 92 | 100 | 74 | | 38 | 58.000 | 5 | 0 | 5 | 0 | 5 | 92 | 80 | 100 | | 39 | 28.000 | 5 | 0 | 4 | 1 | 4 | 100 | 86 | 100 | | 40 | 193.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 41 | 35.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 42 | 168.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 43 | 143.000 | 5 | 0 | 4 | 1 | 4 | 100 | 100 | 86 | | 44 | -80.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 45 | 18.000 | 5 | 0 | 2 | 3 | 2 | 84 | 100 | 100 | | 46 | -80.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 98 | | 47 | -80.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 48 | -80.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 | | 49 | 51.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 | | 50 | 51.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 | ## Plot - Episode reward trend plot: `openenv_episode_rewards.png` - Color mapping in the plot: red=baseline policy, green=target policy.