Spaces:

Jashwanth2511
/

shadowops-hackathon

Sleeping

ShadowOps Deploy

Final deploy: Monolithic ShadowOps app + Training Scripts

d064478 29 days ago

5.87 kB

ShadowOps OpenEnv Loop Evaluation

Validated Locally: The Q-aware fallback and OpenEnv safety evaluation are valid and remain the basis for current claims.
Pending GPU Validation: The trained-model evaluation has not yet produced valid metrics in the standard local environment. Trained-model evidence will be added only when GPU evaluation artifacts exist. No trained-model comparison claims are made at this stage.

Metric	Value
Total steps	250
Malicious steps	139
Benign steps	111
Accuracy	0.164
Unsafe allow rate (malicious-only)	0.000
Safe block rate (benign blocked/forked/quarantined)	1.000
Average confidence	0.428
Mean reward per step	6.896
Step pass count	139
Step fail count	111

Metric	Baseline	Target	Delta (target-baseline)
Unsafe allow rate	0.403	0.000	-0.403
Safe block rate	0.270	1.000	+0.730
Average confidence	0.775	0.428	-0.347
Mean reward/step	33.524	6.896	-26.628
Safety Adjusted Score	8.957	-5.324	-14.281

Note on Reward vs Safety: The heuristic baseline may occasionally have a higher mean_reward_per_step due to faster resolution times.
However, Q-aware is considered safer when its unsafe_allow_rate = 0.000. Unsafe allow is the primary failure mode in security automation and carries severe negative business impact.
Lower confidence scores in Q-aware do not necessarily mean failure; they often reflect cautious uncertainty on ambiguous payloads, which correctly triggers QUARANTINE instead of false-positive blocks or dangerous allows.

See openenv_behavior_comparison.md for 5-10 structured before/after scenarios.

Scenario	Baseline	Target	Baseline correct	Target correct
AWS::UPDATE_IAM	BLOCK	QUARANTINE	False	True
AWS::UPDATE_SECURITY_GROUP	BLOCK	QUARANTINE	False	True
AWS::RESIZE_INSTANCE	BLOCK	QUARANTINE	False	True
AWS::ENABLE_LOGGING	BLOCK	QUARANTINE	False	True
AWS::UPDATE_IAM	BLOCK	QUARANTINE	False	True
AWS::UPDATE_IAM	BLOCK	QUARANTINE	False	True
AWS::RESIZE_INSTANCE	BLOCK	QUARANTINE	False	True
AWS::ROTATE_KEYS	FORK	QUARANTINE	False	True

Episode	Reward	Steps	Safe blocks	Pass	Fail	Final SOC	Final GitHub	Final AWS
1	-38.000	5	1	4	1	100	100	92
2	70.000	5	5	0	5	78	100	100
3	143.000	5	4	1	4	100	92	100
4	-5.000	5	0	5	0	100	100	100
5	51.000	5	4	1	4	84	96	100
6	51.000	5	1	4	1	100	92	100
7	70.000	5	5	0	5	100	86	92
8	-38.000	5	1	4	1	92	100	100
9	-80.000	5	0	5	0	100	100	100
10	193.000	5	1	4	1	100	98	100
11	121.000	5	4	1	4	88	92	100
12	193.000	5	1	4	1	100	98	100
13	124.000	5	3	2	3	100	85	100
14	-80.000	5	1	4	1	100	98	100
15	81.000	5	2	3	2	92	100	92
16	-38.000	5	1	4	1	92	100	100
17	-38.000	5	1	4	1	100	100	92
18	91.000	5	1	4	1	92	100	100
19	-28.000	5	5	0	5	80	92	100
20	91.000	5	4	1	4	88	92	100
21	-38.000	5	1	4	1	100	100	92
22	28.000	5	5	0	5	100	90	100
23	-38.000	5	1	4	1	100	92	100
24	-38.000	5	5	0	5	100	78	100
25	58.000	5	2	3	2	92	100	92
26	70.000	5	5	0	5	92	92	100
27	58.000	5	5	0	5	86	100	92
28	58.000	5	5	0	5	100	74	92
29	91.000	5	1	4	1	92	100	100
30	-38.000	5	1	4	1	100	92	100
31	-5.000	5	0	5	0	100	100	100
32	48.000	5	3	2	3	92	84	100
33	124.000	5	3	2	3	94	100	100
34	2.000	5	1	4	1	100	100	92
35	-5.000	5	0	5	0	100	100	100
36	18.000	5	5	0	5	88	84	100
37	-28.000	5	5	0	5	92	100	74
38	58.000	5	5	0	5	92	80	100
39	28.000	5	4	1	4	100	86	100
40	193.000	5	0	5	0	100	100	100
41	35.000	5	0	5	0	100	100	100
42	168.000	5	0	5	0	100	100	100
43	143.000	5	4	1	4	100	100	86
44	-80.000	5	0	5	0	100	100	100
45	18.000	5	2	3	2	84	100	100
46	-80.000	5	1	4	1	100	100	98
47	-80.000	5	0	5	0	100	100	100
48	-80.000	5	0	5	0	100	100	100
49	51.000	5	1	4	1	92	100	100
50	51.000	5	1	4	1	92	100	100