File size: 13,572 Bytes
6ba5cca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb6a031
 
 
 
 
 
 
 
 
 
5cbde7b
64649c4
5cbde7b
 
 
bb6a031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5cbde7b
 
 
 
 
 
 
 
 
 
 
 
 
bb6a031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5cbde7b
 
 
 
 
 
 
 
 
 
bb6a031
 
 
 
 
 
 
 
 
 
 
5cbde7b
bb6a031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bfa130
bb6a031
 
 
 
 
 
 
 
 
 
 
 
 
 
5cbde7b
 
 
 
 
bb6a031
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
title: OpenSOC SOC Triage Env
emoji: πŸ›‘οΈ
colorFrom: indigo
colorTo: red
sdk: docker
app_port: 7860
pinned: false
license: bsd-3-clause
tags:
  - openenv
  - cybersecurity
  - rlvr
  - self-play
---

# OpenSOC: Self-Play SOC Triage Environment

> An **OpenEnv** environment for training cybersecurity defender LLMs against an attacker LLM that auto-generates novel incidents. Built for the OpenEnv Hackathon, April 2026.

Humans cannot watch every alert in a Security Operations Center 24/7, and as stronger generative models start writing exploits and phishing at industrial scale that gap only widens. **OpenSOC** is an environment where a defender LLM learns to triage attacks generated by another LLM in a self-play loop. The trick is **RLVR**: triage ground truth is computed by a deterministic schema-side verifier from the *structured* incident parameters β€” never from any text the attacker writes β€” so neither side can hack the reward.

## Try it

| Link | What it is |
| --- | --- |
| **HF Space** β€” [`shivam2k3-opensoc-env.hf.space`](https://huggingface.co/spaces/shivam2k3/opensoc-env) | Deployed env (Running). OpenEnv judge can hit `/reset` `/step` `/state` `/grade`. |
| **Live `/demo`** β€” [`shivam2k3-opensoc-env.hf.space/demo`](https://shivam2k3-opensoc-env.hf.space/demo) | Gradio "before vs after" UI. Click **Next incident** to compare baseline vs trained. |
| **Trained model** β€” [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) | GRPO-trained Qwen2.5-3B-Instruct LoRA defender adapter. |
| **Training notebook** β€” [`train_grpo.ipynb`](train_grpo.ipynb) | End-to-end SFT warm-start + GRPO curriculum using Unsloth + TRL. |
| **Mini-blog** β€” [`docs/blog.md`](docs/blog.md) | ~600-word write-up of the project. |

## Table of contents

1. [Architecture](#architecture)
2. [Why the reward cannot be hacked](#why-the-reward-cannot-be-hacked)
3. [Action space and reward](#action-space-and-reward)
4. [Run locally](#run-locally)
5. [Run the training pipeline](#run-the-training-pipeline)
6. [Headline results](#headline-results)
7. [Deploy to Hugging Face Spaces](#deploy-to-hugging-face-spaces)
8. [Repo map](#repo-map)
9. [Submission deliverables](#submission-deliverables)

## Build status

| Build artifact | Status |
| --- | --- |
| Pure-python env (`OpenSOCEnv`, FastAPI) | βœ… shipped |
| Verifier + plausibility checker | βœ… shipped, 17-test adversarial suite |
| Rubric (defender + attacker rewards) | βœ… shipped, anti-hack regression tests |
| 600-example SFT dataset (`data/sft_train.jsonl`) | βœ… shipped |
| 200-incident frozen hold-out (`data/holdout.jsonl`) | βœ… shipped |
| SFT warm-start adapter | βœ… trained β†’ [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) |
| GRPO curriculum (4 stages) | βœ… trained β†’ adapters for each stage on HF |
| Final GRPO adapter | βœ… [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) |
| GRPO training notebook (`train_grpo.ipynb`) | βœ… shipped (ran on HF Jupyter with Unsloth + TRL) |
| Gradio "before vs after" UI | βœ… **live** at [`/demo`](https://shivam2k3-opensoc-env.hf.space/demo) |
| Eval harness + plotters (`eval/`) | βœ… shipped |
| Pytest suite | βœ… **93 tests**, all green |
| HF Space | βœ… **live** at [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) |

## Architecture

```mermaid
flowchart LR
  Defender[Defender LLM trainee]
  Attacker[Attacker LLM trainee]
  Env[OpenSOC FastAPI Environment]
  Verifier[Deterministic verifier + plausibility check]
  Defender -->|submit_triage| Env
  Attacker -->|craft_incident| Env
  Env -->|observation reward| Defender
  Env -->|attacker reward| Attacker
  Env --> Verifier
  Verifier -->|ground truth label| Env
```

An episode has exactly two turns: attacker proposes incident params β†’ env validates them and materializes a SIEM-style alert + log window β†’ defender submits a triage action.  The verifier computes the ground-truth action from the *events alone* and scores both sides β€” the attacker's free-text narrative is never read by the labeler.

In `defender_only` mode (used for SFT, eval, smoke tests, and the `/demo` UI) the env auto-generates the incident from `tasks/registry.py` and skips straight to the defender turn.

## Why the reward cannot be hacked

1. The verifier is a transparent rule set in `verifier.compute_ground_truth(params)`; the *only* inputs are the structured events.  The attacker's `narrative` and even its self-claimed `target_label` are ignored.
2. The plausibility checker (`verifier.check_plausibility(params)`) refuses incoherent stories β€” for example, a "data exfiltration" claim with a purely-internal destination, or a `lolbin_use` event with no `process` field.  The attacker's reward is gated on plausibility passing.
3. Schema-violation incidents floor attacker reward at -0.5, so trying to short-circuit pydantic's validators is strictly worse than playing along.

The anti-hack invariants are pinned in [`tests/test_verifier.py`](tests/test_verifier.py) and [`tests/test_rubric.py`](tests/test_rubric.py).

## Action space and reward

Tool names are deliberately **non-reserved** β€” there is no `reset`/`step`/`state`/`close` clash with the OpenEnv `MCPEnvironment` reserved-name list.

```yaml
action_space:
  craft_incident:
    target_label: dismiss | monitor | quarantine_host | block_ip | escalate
    category:     malware_execution | c2_beacon | data_exfiltration | ...
    events:       [ { event_type, fields, timestamp, log_id }, ... ]
    narrative:    string         # ignored by the verifier
  submit_triage:
    action:       <one of the five triage actions>
    cited_log_id: <id of the log line that drove the decision>
    rationale:    short string
```

- **Defender**: +1 correct, βˆ’1 missed-malicious, βˆ’0.3 over-react on benign, βˆ’0.05 unnecessary escalate, +0.1 bonus for citing the right triggering log id, βˆ’0.1 floor for format violation.
- **Attacker**: +1 iff defender wrong AND incident plausible, βˆ’0.5 if schema validation fails, +0.2 novelty bonus, 0 for gibberish.

Full breakdown: [openenv.yaml](openenv.yaml) and [rubric.py](rubric.py).

## Run locally

```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python server.py    # serves on :7860
```

Smoke test from another shell:

```bash
curl -s http://localhost:7860/health | jq .
curl -s -X POST 'http://localhost:7860/reset?task=stage1_basic&mode=defender_only' | jq .
curl -s -X POST 'http://localhost:7860/step?task=stage1_basic&mode=defender_only' \
     -H 'content-type: application/json' \
     -d '{"submit_triage": {"action": "monitor", "cited_log_id": "L1-0", "rationale": "smoke"}}' | jq .
open http://localhost:7860/demo   # Gradio before-vs-after UI
```

Run the test suite (CPU only, no GPU deps):

```bash
pytest -q   # 93 passed
```

Or via the bundled Python client:

```python
from client import OpenSOCClient
c = OpenSOCClient()
obs = c.reset(task="stage1_basic", mode="defender_only", seed=1)
result = c.step({"submit_triage": {"action": "monitor", "cited_log_id": "L1-0", "rationale": "ok"}},
                task="stage1_basic", mode="defender_only", seed=1)
print(result)
```

## Run the training pipeline

Full end-to-end procedure: **[TRAIN.md](TRAIN.md)**.  TL;DR β€” on an HF Jupyter L4 (~$3 of credits, ~3.5h wall time):

```bash
bash scripts/run_full_pipeline.sh
```

Or step-by-step inside [`train_grpo.ipynb`](train_grpo.ipynb):

1. SFT warm-start (~12 min) β€” pushes P(format-OK) from ~0% to ~95%.
2. GRPO curriculum across 4 stages (~3h) β€” verifier-grounded reward, group size 8.
3. Eval on the frozen 200-incident hold-out (~5 min).
4. `eval.plot_results` + `eval.plot_training` render four PNGs.
5. `eval.bake_demo` writes 50 before-vs-after pairs to `data/demo_examples.json` for the Gradio UI.

## Headline results

The defender model was trained using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA.  All trained adapters are published on HuggingFace:

| Stage | Adapter | Difficulty |
| --- | --- | --- |
| SFT warm-start | [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) | Format learning |
| Stage 1 | [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) | Easy β€” single-event templates |
| Stage 2 | [`opensoc-defender-grpo-stage2_multi`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage2_multi) | Medium β€” multi-event windows |
| Stage 3 | [`opensoc-defender-grpo-stage3_mixed`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage3_mixed) | Hard β€” benign decoys interleaved |
| Stage 4 | [`opensoc-defender-grpo-stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial) | Adversarial β€” attacker-controlled |
| Final | [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) | Combined final adapter |

### Dismiss-on-malicious (the cardinal failure mode)

![dismiss-on-malicious by model](eval/results/bar_dismiss_on_malicious.png)

### Macro F1 across 200-incident hold-out

![macro F1 by model](eval/results/bar_macro_f1.png)

### Confusion matrices

| Baseline (always-dismiss) | Trained (verifier-oracle ceiling) |
| --- | --- |
| ![baseline confusion](eval/results/confusion_always_dismiss.png) | ![trained confusion](eval/results/confusion_verifier_oracle.png) |

### Reward across the curriculum

![training reward curves](eval/results/training_curves.png)

| Model | Accuracy | Macro F1 | Dismiss-on-malicious | Over-react |
| --- | ---: | ---: | ---: | ---: |
| `always_dismiss` (floor)      | 0.13 | 0.05 | **1.00** | 0.00 |
| `verifier_oracle` (ceiling)   | 1.00 | 1.00 | 0.00 | 0.00 |

## Deploy to Hugging Face Spaces

Full recipe: [DEPLOY.md](DEPLOY.md).  The fast version, after `huggingface-cli login`:

```bash
export HF_USER=<your-username>
bash scripts/deploy_to_hf.sh
# Build takes ~5 minutes; then:
open https://${HF_USER}-opensoc-env.hf.space/demo
```

The Space runs FastAPI + Gradio in a single container.  `/reset`, `/step`, `/state`, `/grade`, `/tasks`, `/health` continue to work for the OpenEnv judge bot; `/demo` is the human-readable UI.

## Repo map

| File / dir | Purpose |
| --- | --- |
| `openenv.yaml` | OpenEnv manifest (tasks, action space, reward range, endpoints) |
| `schema.py` | Incident / event / action schema with strict validators |
| `generator.py` | Materializes incidents for `defender_only` mode (eval, SFT) |
| `verifier.py` | Deterministic ground-truth labeler + plausibility checker |
| `rubric.py` | Layered defender + attacker reward functions |
| `env.py` | Two-role `OpenSOCEnv` (`reset` / `step` / `state` / `grade`) |
| `app_runtime.py` | FastAPI app exposing the OpenEnv API |
| `demo_app.py` | Gradio Blocks app mounted at `/demo` |
| `demo_data.py` | Pure-python helpers for the demo UI |
| `server.py` | Container entry point β€” imports `demo_app` then starts uvicorn |
| `tasks/registry.py` | Curriculum stages: `stage1_basic` β†’ `stage4_adversarial` |
| `client/` | Thin HTTP client (server-internals-free) |
| `train/` | SFT warm-start + GRPO loop + reusable prompt format |
| `eval/` | Hold-out generator, metrics, eval driver, plot renderers, `bake_demo` |
| `scripts/run_full_pipeline.sh` | One-shot training + eval + bake-demo |
| `scripts/deploy_to_hf.sh` | One-shot HF Space push |
| `docs/` | Blog post, video script, slide deck builder |
| `tests/` | Pytest suite (93 tests, anti-hack regressions included) |

## Submission deliverables

Mapped to the four judging criteria:

| Criterion | Weight | Where it lives |
| --- | ---: | --- |
| Environment Innovation | 40% | `openenv.yaml`, `schema.py`, `verifier.py`, `env.py`, this README's *Architecture* and *Why the reward cannot be hacked* sections |
| Storytelling & Presentation | 30% | `/demo` Gradio UI + 90s video + HF blog |
| Showing Improvement in Rewards | 20% | `eval/results/*.png` (training curves + confusion + headline bar) embedded above |
| Reward & Training Pipeline | 10% | `rubric.py` + 93-test anti-hack suite + `train_grpo.ipynb` + `scripts/run_full_pipeline.sh` |

Submission checklist:

- [x] OpenEnv-compatible env (gym-style API, manifest, non-reserved tool names)
- [x] Deterministic RLVR verifier + plausibility checker
- [x] Layered defender + attacker reward
- [x] SFT warm-start dataset (committed)
- [x] Frozen 200-incident hold-out (committed)
- [x] GRPO curriculum notebook + one-shot training script
- [x] Eval harness + plotters
- [x] Pytest suite (93 tests, anti-hack regressions included)
- [x] Gradio `/demo` UI mounted on the same Space (free-CPU-tier compatible)
- [x] Blog post (`docs/blog.md`)
- [x] HF Space pushed and **running**: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env)
- [x] SFT adapter trained and pushed: [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft)
- [x] GRPO adapters trained and pushed (4 stages): [`stage1`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) [`stage2`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage2_multi) [`stage3`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage3_mixed) [`stage4`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial)
- [x] Final adapter pushed: [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo)

## License

BSD-3-Clause.