shadowops-hackathon / backend-ml /training /reports /openenv_loop_eval.md
ShadowOps Deploy
Final deploy: Monolithic ShadowOps app + Training Scripts
d064478
# ShadowOps OpenEnv Loop Evaluation
- Policy evaluated: q_aware
- Baseline policy for comparison: heuristic
- Episodes: 50
- Episode max length: 5
- Seed: 42
- Run label: full_eval_50_episodes
- Scope note: FULL EVAL: 50 episodes (judge-facing run size).
## Validation Note
- **Validated Locally**: The Q-aware fallback and OpenEnv safety evaluation are valid and remain the basis for current claims.
- **Pending GPU Validation**: The trained-model evaluation has not yet produced valid metrics in the standard local environment. Trained-model evidence will be added only when GPU evaluation artifacts exist. No trained-model comparison claims are made at this stage.
## Core Metrics
| Metric | Value |
| --- | ---: |
| Total steps | 250 |
| Malicious steps | 139 |
| Benign steps | 111 |
| Accuracy | 0.164 |
| Unsafe allow rate (malicious-only) | 0.000 |
| Safe block rate (benign blocked/forked/quarantined) | 1.000 |
| Average confidence | 0.428 |
| Mean reward per step | 6.896 |
| Step pass count | 139 |
| Step fail count | 111 |
## Reward Summary
- Per-step reward mean/median/std: 6.896 / 2.000 / 31.080
- Per-episode reward mean/median/std: 34.480 / 41.500 / 76.698
- Per-episode reward min/max: -80.000 / 193.000
## Before vs After Aggregate
| Metric | Baseline | Target | Delta (target-baseline) |
| --- | ---: | ---: | ---: |
| Unsafe allow rate | 0.403 | 0.000 | -0.403 |
| Safe block rate | 0.270 | 1.000 | +0.730 |
| Average confidence | 0.775 | 0.428 | -0.347 |
| Mean reward/step | 33.524 | 6.896 | -26.628 |
| Safety Adjusted Score | 8.957 | -5.324 | -14.281 |
## Safety vs Reward Trade-Off
- **Note on Reward vs Safety**: The `heuristic` baseline may occasionally have a higher `mean_reward_per_step` due to faster resolution times.
- However, **Q-aware is considered safer** when its `unsafe_allow_rate = 0.000`. Unsafe allow is the primary failure mode in security automation and carries severe negative business impact.
- Lower confidence scores in Q-aware do not necessarily mean failure; they often reflect **cautious uncertainty** on ambiguous payloads, which correctly triggers QUARANTINE instead of false-positive blocks or dangerous allows.
## Representative Behavior
See `openenv_behavior_comparison.md` for 5-10 structured before/after scenarios.
| Scenario | Baseline | Target | Baseline correct | Target correct |
| --- | --- | --- | --- | --- |
| AWS::UPDATE_IAM | BLOCK | QUARANTINE | False | True |
| AWS::UPDATE_SECURITY_GROUP | BLOCK | QUARANTINE | False | True |
| AWS::RESIZE_INSTANCE | BLOCK | QUARANTINE | False | True |
| AWS::ENABLE_LOGGING | BLOCK | QUARANTINE | False | True |
| AWS::UPDATE_IAM | BLOCK | QUARANTINE | False | True |
| AWS::UPDATE_IAM | BLOCK | QUARANTINE | False | True |
| AWS::RESIZE_INSTANCE | BLOCK | QUARANTINE | False | True |
| AWS::ROTATE_KEYS | FORK | QUARANTINE | False | True |
## Episode Summary
| Episode | Reward | Steps | Unsafe allows | Safe blocks | Pass | Fail | Final SOC | Final GitHub | Final AWS |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 1 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 92 |
| 2 | 70.000 | 5 | 0 | 5 | 0 | 5 | 78 | 100 | 100 |
| 3 | 143.000 | 5 | 0 | 4 | 1 | 4 | 100 | 92 | 100 |
| 4 | -5.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 5 | 51.000 | 5 | 0 | 4 | 1 | 4 | 84 | 96 | 100 |
| 6 | 51.000 | 5 | 0 | 1 | 4 | 1 | 100 | 92 | 100 |
| 7 | 70.000 | 5 | 0 | 5 | 0 | 5 | 100 | 86 | 92 |
| 8 | -38.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 |
| 9 | -80.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 10 | 193.000 | 5 | 0 | 1 | 4 | 1 | 100 | 98 | 100 |
| 11 | 121.000 | 5 | 0 | 4 | 1 | 4 | 88 | 92 | 100 |
| 12 | 193.000 | 5 | 0 | 1 | 4 | 1 | 100 | 98 | 100 |
| 13 | 124.000 | 5 | 0 | 3 | 2 | 3 | 100 | 85 | 100 |
| 14 | -80.000 | 5 | 0 | 1 | 4 | 1 | 100 | 98 | 100 |
| 15 | 81.000 | 5 | 0 | 2 | 3 | 2 | 92 | 100 | 92 |
| 16 | -38.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 |
| 17 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 92 |
| 18 | 91.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 |
| 19 | -28.000 | 5 | 0 | 5 | 0 | 5 | 80 | 92 | 100 |
| 20 | 91.000 | 5 | 0 | 4 | 1 | 4 | 88 | 92 | 100 |
| 21 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 92 |
| 22 | 28.000 | 5 | 0 | 5 | 0 | 5 | 100 | 90 | 100 |
| 23 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 92 | 100 |
| 24 | -38.000 | 5 | 0 | 5 | 0 | 5 | 100 | 78 | 100 |
| 25 | 58.000 | 5 | 0 | 2 | 3 | 2 | 92 | 100 | 92 |
| 26 | 70.000 | 5 | 0 | 5 | 0 | 5 | 92 | 92 | 100 |
| 27 | 58.000 | 5 | 0 | 5 | 0 | 5 | 86 | 100 | 92 |
| 28 | 58.000 | 5 | 0 | 5 | 0 | 5 | 100 | 74 | 92 |
| 29 | 91.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 |
| 30 | -38.000 | 5 | 0 | 1 | 4 | 1 | 100 | 92 | 100 |
| 31 | -5.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 32 | 48.000 | 5 | 0 | 3 | 2 | 3 | 92 | 84 | 100 |
| 33 | 124.000 | 5 | 0 | 3 | 2 | 3 | 94 | 100 | 100 |
| 34 | 2.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 92 |
| 35 | -5.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 36 | 18.000 | 5 | 0 | 5 | 0 | 5 | 88 | 84 | 100 |
| 37 | -28.000 | 5 | 0 | 5 | 0 | 5 | 92 | 100 | 74 |
| 38 | 58.000 | 5 | 0 | 5 | 0 | 5 | 92 | 80 | 100 |
| 39 | 28.000 | 5 | 0 | 4 | 1 | 4 | 100 | 86 | 100 |
| 40 | 193.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 41 | 35.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 42 | 168.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 43 | 143.000 | 5 | 0 | 4 | 1 | 4 | 100 | 100 | 86 |
| 44 | -80.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 45 | 18.000 | 5 | 0 | 2 | 3 | 2 | 84 | 100 | 100 |
| 46 | -80.000 | 5 | 0 | 1 | 4 | 1 | 100 | 100 | 98 |
| 47 | -80.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 48 | -80.000 | 5 | 0 | 0 | 5 | 0 | 100 | 100 | 100 |
| 49 | 51.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 |
| 50 | 51.000 | 5 | 0 | 1 | 4 | 1 | 92 | 100 | 100 |
## Plot
- Episode reward trend plot: `openenv_episode_rewards.png`
- Color mapping in the plot: red=baseline policy, green=target policy.