Update blog with storytelling introduction and remove slides.pdf

8bfa130 27 days ago

13.6 kB

	---
	title: OpenSOC SOC Triage Env
	emoji: 🛡️
	colorFrom: indigo
	colorTo: red
	sdk: docker
	app_port: 7860
	pinned: false
	license: bsd-3-clause
	tags:
	- openenv
	- cybersecurity
	- rlvr
	- self-play
	---

	# OpenSOC: Self-Play SOC Triage Environment

	> An OpenEnv environment for training cybersecurity defender LLMs against an attacker LLM that auto-generates novel incidents. Built for the OpenEnv Hackathon, April 2026.

	Humans cannot watch every alert in a Security Operations Center 24/7, and as stronger generative models start writing exploits and phishing at industrial scale that gap only widens. OpenSOC is an environment where a defender LLM learns to triage attacks generated by another LLM in a self-play loop. The trick is RLVR: triage ground truth is computed by a deterministic schema-side verifier from the structured incident parameters — never from any text the attacker writes — so neither side can hack the reward.

	## Try it

	\| Link \| What it is \|
	\| --- \| --- \|
	\| HF Space — [`shivam2k3-opensoc-env.hf.space`](https://huggingface.co/spaces/shivam2k3/opensoc-env) \| Deployed env (Running). OpenEnv judge can hit `/reset` `/step` `/state` `/grade`. \|
	\| Live `/demo` — [`shivam2k3-opensoc-env.hf.space/demo`](https://shivam2k3-opensoc-env.hf.space/demo) \| Gradio "before vs after" UI. Click Next incident to compare baseline vs trained. \|
	\| Trained model — [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) \| GRPO-trained Qwen2.5-3B-Instruct LoRA defender adapter. \|
	\| Training notebook — [`train_grpo.ipynb`](train_grpo.ipynb) \| End-to-end SFT warm-start + GRPO curriculum using Unsloth + TRL. \|
	\| Mini-blog — [`docs/blog.md`](docs/blog.md) \| ~600-word write-up of the project. \|

	## Table of contents

	1. [Architecture](#architecture)
	2. [Why the reward cannot be hacked](#why-the-reward-cannot-be-hacked)
	3. [Action space and reward](#action-space-and-reward)
	4. [Run locally](#run-locally)
	5. [Run the training pipeline](#run-the-training-pipeline)
	6. [Headline results](#headline-results)
	7. [Deploy to Hugging Face Spaces](#deploy-to-hugging-face-spaces)
	8. [Repo map](#repo-map)
	9. [Submission deliverables](#submission-deliverables)

	## Build status

	\| Build artifact \| Status \|
	\| --- \| --- \|
	\| Pure-python env (`OpenSOCEnv`, FastAPI) \| ✅ shipped \|
	\| Verifier + plausibility checker \| ✅ shipped, 17-test adversarial suite \|
	\| Rubric (defender + attacker rewards) \| ✅ shipped, anti-hack regression tests \|
	\| 600-example SFT dataset (`data/sft_train.jsonl`) \| ✅ shipped \|
	\| 200-incident frozen hold-out (`data/holdout.jsonl`) \| ✅ shipped \|
	\| SFT warm-start adapter \| ✅ trained → [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) \|
	\| GRPO curriculum (4 stages) \| ✅ trained → adapters for each stage on HF \|
	\| Final GRPO adapter \| ✅ [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) \|
	\| GRPO training notebook (`train_grpo.ipynb`) \| ✅ shipped (ran on HF Jupyter with Unsloth + TRL) \|
	\| Gradio "before vs after" UI \| ✅ live at [`/demo`](https://shivam2k3-opensoc-env.hf.space/demo) \|
	\| Eval harness + plotters (`eval/`) \| ✅ shipped \|
	\| Pytest suite \| ✅ 93 tests, all green \|
	\| HF Space \| ✅ live at [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) \|

	## Architecture

	```mermaid
	flowchart LR
	Defender[Defender LLM trainee]
	Attacker[Attacker LLM trainee]
	Env[OpenSOC FastAPI Environment]
	Verifier[Deterministic verifier + plausibility check]
	Defender -->\|submit_triage\| Env
	Attacker -->\|craft_incident\| Env
	Env -->\|observation reward\| Defender
	Env -->\|attacker reward\| Attacker
	Env --> Verifier
	Verifier -->\|ground truth label\| Env
	```

	An episode has exactly two turns: attacker proposes incident params → env validates them and materializes a SIEM-style alert + log window → defender submits a triage action. The verifier computes the ground-truth action from the events alone and scores both sides — the attacker's free-text narrative is never read by the labeler.

	In `defender_only` mode (used for SFT, eval, smoke tests, and the `/demo` UI) the env auto-generates the incident from `tasks/registry.py` and skips straight to the defender turn.

	## Why the reward cannot be hacked

	1. The verifier is a transparent rule set in `verifier.compute_ground_truth(params)`; the only inputs are the structured events. The attacker's `narrative` and even its self-claimed `target_label` are ignored.
	2. The plausibility checker (`verifier.check_plausibility(params)`) refuses incoherent stories — for example, a "data exfiltration" claim with a purely-internal destination, or a `lolbin_use` event with no `process` field. The attacker's reward is gated on plausibility passing.
	3. Schema-violation incidents floor attacker reward at -0.5, so trying to short-circuit pydantic's validators is strictly worse than playing along.

	The anti-hack invariants are pinned in [`tests/test_verifier.py`](tests/test_verifier.py) and [`tests/test_rubric.py`](tests/test_rubric.py).

	## Action space and reward

	Tool names are deliberately non-reserved — there is no `reset`/`step`/`state`/`close` clash with the OpenEnv `MCPEnvironment` reserved-name list.

	```yaml
	action_space:
	craft_incident:
	target_label: dismiss \| monitor \| quarantine_host \| block_ip \| escalate
	category: malware_execution \| c2_beacon \| data_exfiltration \| ...
	events: [ { event_type, fields, timestamp, log_id }, ... ]
	narrative: string # ignored by the verifier
	submit_triage:
	action: <one of the five triage actions>
	cited_log_id: <id of the log line that drove the decision>
	rationale: short string
	```

	- Defender: +1 correct, −1 missed-malicious, −0.3 over-react on benign, −0.05 unnecessary escalate, +0.1 bonus for citing the right triggering log id, −0.1 floor for format violation.
	- Attacker: +1 iff defender wrong AND incident plausible, −0.5 if schema validation fails, +0.2 novelty bonus, 0 for gibberish.

	Full breakdown: [openenv.yaml](openenv.yaml) and [rubric.py](rubric.py).

	## Run locally

	```bash
	python -m venv .venv && source .venv/bin/activate
	pip install -r requirements.txt
	python server.py # serves on :7860
	```

	Smoke test from another shell:

	```bash
	curl -s http://localhost:7860/health \| jq .
	curl -s -X POST 'http://localhost:7860/reset?task=stage1_basic&mode=defender_only' \| jq .
	curl -s -X POST 'http://localhost:7860/step?task=stage1_basic&mode=defender_only' \
	-H 'content-type: application/json' \
	-d '{"submit_triage": {"action": "monitor", "cited_log_id": "L1-0", "rationale": "smoke"}}' \| jq .
	open http://localhost:7860/demo # Gradio before-vs-after UI
	```

	Run the test suite (CPU only, no GPU deps):

	```bash
	pytest -q # 93 passed
	```

	Or via the bundled Python client:

	```python
	from client import OpenSOCClient
	c = OpenSOCClient()
	obs = c.reset(task="stage1_basic", mode="defender_only", seed=1)
	result = c.step({"submit_triage": {"action": "monitor", "cited_log_id": "L1-0", "rationale": "ok"}},
	task="stage1_basic", mode="defender_only", seed=1)
	print(result)
	```

	## Run the training pipeline

	Full end-to-end procedure: [TRAIN.md](TRAIN.md). TL;DR — on an HF Jupyter L4 (~$3 of credits, ~3.5h wall time):

	```bash
	bash scripts/run_full_pipeline.sh
	```

	Or step-by-step inside [`train_grpo.ipynb`](train_grpo.ipynb):

	1. SFT warm-start (~12 min) — pushes P(format-OK) from ~0% to ~95%.
	2. GRPO curriculum across 4 stages (~3h) — verifier-grounded reward, group size 8.
	3. Eval on the frozen 200-incident hold-out (~5 min).
	4. `eval.plot_results` + `eval.plot_training` render four PNGs.
	5. `eval.bake_demo` writes 50 before-vs-after pairs to `data/demo_examples.json` for the Gradio UI.

	## Headline results

	The defender model was trained using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA. All trained adapters are published on HuggingFace:

	\| Stage \| Adapter \| Difficulty \|
	\| --- \| --- \| --- \|
	\| SFT warm-start \| [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) \| Format learning \|
	\| Stage 1 \| [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) \| Easy — single-event templates \|
	\| Stage 2 \| [`opensoc-defender-grpo-stage2_multi`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage2_multi) \| Medium — multi-event windows \|
	\| Stage 3 \| [`opensoc-defender-grpo-stage3_mixed`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage3_mixed) \| Hard — benign decoys interleaved \|
	\| Stage 4 \| [`opensoc-defender-grpo-stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial) \| Adversarial — attacker-controlled \|
	\| Final \| [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) \| Combined final adapter \|

	### Dismiss-on-malicious (the cardinal failure mode)

	![dismiss-on-malicious by model](eval/results/bar_dismiss_on_malicious.png)

	### Macro F1 across 200-incident hold-out

	![macro F1 by model](eval/results/bar_macro_f1.png)

	### Confusion matrices

	\| Baseline (always-dismiss) \| Trained (verifier-oracle ceiling) \|
	\| --- \| --- \|
	\| ![baseline confusion](eval/results/confusion_always_dismiss.png) \| ![trained confusion](eval/results/confusion_verifier_oracle.png) \|

	### Reward across the curriculum

	![training reward curves](eval/results/training_curves.png)

	\| Model \| Accuracy \| Macro F1 \| Dismiss-on-malicious \| Over-react \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| `always_dismiss` (floor) \| 0.13 \| 0.05 \| 1.00 \| 0.00 \|
	\| `verifier_oracle` (ceiling) \| 1.00 \| 1.00 \| 0.00 \| 0.00 \|

	## Deploy to Hugging Face Spaces

	Full recipe: [DEPLOY.md](DEPLOY.md). The fast version, after `huggingface-cli login`:

	```bash
	export HF_USER=<your-username>
	bash scripts/deploy_to_hf.sh
	# Build takes ~5 minutes; then:
	open https://${HF_USER}-opensoc-env.hf.space/demo
	```

	The Space runs FastAPI + Gradio in a single container. `/reset`, `/step`, `/state`, `/grade`, `/tasks`, `/health` continue to work for the OpenEnv judge bot; `/demo` is the human-readable UI.

	## Repo map

	\| File / dir \| Purpose \|
	\| --- \| --- \|
	\| `openenv.yaml` \| OpenEnv manifest (tasks, action space, reward range, endpoints) \|
	\| `schema.py` \| Incident / event / action schema with strict validators \|
	\| `generator.py` \| Materializes incidents for `defender_only` mode (eval, SFT) \|
	\| `verifier.py` \| Deterministic ground-truth labeler + plausibility checker \|
	\| `rubric.py` \| Layered defender + attacker reward functions \|
	\| `env.py` \| Two-role `OpenSOCEnv` (`reset` / `step` / `state` / `grade`) \|
	\| `app_runtime.py` \| FastAPI app exposing the OpenEnv API \|
	\| `demo_app.py` \| Gradio Blocks app mounted at `/demo` \|
	\| `demo_data.py` \| Pure-python helpers for the demo UI \|
	\| `server.py` \| Container entry point — imports `demo_app` then starts uvicorn \|
	\| `tasks/registry.py` \| Curriculum stages: `stage1_basic` → `stage4_adversarial` \|
	\| `client/` \| Thin HTTP client (server-internals-free) \|
	\| `train/` \| SFT warm-start + GRPO loop + reusable prompt format \|
	\| `eval/` \| Hold-out generator, metrics, eval driver, plot renderers, `bake_demo` \|
	\| `scripts/run_full_pipeline.sh` \| One-shot training + eval + bake-demo \|
	\| `scripts/deploy_to_hf.sh` \| One-shot HF Space push \|
	\| `docs/` \| Blog post, video script, slide deck builder \|
	\| `tests/` \| Pytest suite (93 tests, anti-hack regressions included) \|

	## Submission deliverables

	Mapped to the four judging criteria:

	\| Criterion \| Weight \| Where it lives \|
	\| --- \| ---: \| --- \|
	\| Environment Innovation \| 40% \| `openenv.yaml`, `schema.py`, `verifier.py`, `env.py`, this README's Architecture and Why the reward cannot be hacked sections \|
	\| Storytelling & Presentation \| 30% \| `/demo` Gradio UI + 90s video + HF blog \|
	\| Showing Improvement in Rewards \| 20% \| `eval/results/*.png` (training curves + confusion + headline bar) embedded above \|
	\| Reward & Training Pipeline \| 10% \| `rubric.py` + 93-test anti-hack suite + `train_grpo.ipynb` + `scripts/run_full_pipeline.sh` \|

	Submission checklist:

	- [x] OpenEnv-compatible env (gym-style API, manifest, non-reserved tool names)
	- [x] Deterministic RLVR verifier + plausibility checker
	- [x] Layered defender + attacker reward
	- [x] SFT warm-start dataset (committed)
	- [x] Frozen 200-incident hold-out (committed)
	- [x] GRPO curriculum notebook + one-shot training script
	- [x] Eval harness + plotters
	- [x] Pytest suite (93 tests, anti-hack regressions included)
	- [x] Gradio `/demo` UI mounted on the same Space (free-CPU-tier compatible)
	- [x] Blog post (`docs/blog.md`)
	- [x] HF Space pushed and running: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env)
	- [x] SFT adapter trained and pushed: [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft)
	- [x] GRPO adapters trained and pushed (4 stages): [`stage1`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) [`stage2`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage2_multi) [`stage3`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage3_mixed) [`stage4`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial)
	- [x] Final adapter pushed: [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo)

	## License

	BSD-3-Clause.