Update blog with storytelling introduction and remove slides.pdf

8bfa130 about 1 month ago

7.35 kB

	# OpenSOC: Teaching an LLM to Triage Cyberattacks via RLVR Self-Play

	A submission for the OpenEnv Hackathon, April 2026.

	---

	As we all know Anthropic has a new model called Mythos. It is claimed that Mythos has found a lot of bugs in open source projects. As LLMs improve the threat of someone using it for malicious attacks increases exponentially. Also a lot of improvements in LLMs have happened due to Reinforcement Learning. Our goal was to have a multi agents setup with an attacker and defender. They learn from each other. In this way we can improve security and always be prepared in case of an attack.

	## Why we built this

	By the time a security operations center (SOC) tier-1 analyst sees an
	alert, the attacker may have been inside for hours. Tier-1 triage is
	mostly judgement: look at one alert plus the small log window around
	it, and decide whether to dismiss, monitor, quarantine, block, or
	escalate. It's also where SOCs are chronically understaffed — a
	pipeline that quietly skips real attacks because the human is asleep is
	not a hypothetical risk.

	LLMs should be able to help. But training one to do this without
	poisoning ourselves on the way is hard:

	1. SOC datasets are private; published ones get stale within months.
	2. Subjective rewards from another LLM let the trainee reward-hack the
	judge, not the world.
	3. Self-play between two LLMs (one playing attacker, one defender)
	collapses to a degenerate equilibrium unless the reward is grounded.

	OpenSOC is our attempt at all three: a synthetic, self-play SOC triage
	environment with a deterministic, schema-driven verifier as the only
	source of reward signal — true Reinforcement Learning with Verifiable
	Rewards (RLVR).

	## What's in the env

	The environment exposes the OpenEnv API (`/reset`, `/step`, `/state`,
	`/grade`) plus a Gradio "before vs after" UI at `/demo`:

	- Schema: a single `schema.py` defines `IncidentParams`, the only
	shape the attacker can produce. Enums for event types and triage
	actions, structured fields, no free-text "vibes".
	- Attacker: must emit a structured `craft_incident` action. Free
	text is allowed but ignored by the verifier.
	- Verifier: `verifier.compute_ground_truth(params)` is a pure
	function of the structured fields. Authentication failures from a
	single source IP at >5 attempts/window get labeled `block_ip`,
	malware_exec on a workstation gets `quarantine_host`, etc. This is
	the RLVR substrate.
	- Plausibility check: a separate `verifier.check_plausibility(params)`
	rejects gibberish (e.g. malware exec with no malicious indicators),
	and the attacker only earns reward when the incident passes
	plausibility. This is what neutralises the classic self-play hack
	where the attacker just emits noise to confuse the defender.
	- Defender: chooses one triage action and cites the log_id that
	drove the decision. Reward layers are: +1.0 for matching the verifier,
	−1.0 for dismiss-on-malicious, −0.3 for over-react on benign, +0.1
	bonus for the correct citation. Dismiss-on-malicious is the cardinal
	failure mode in real SOCs and we grade it accordingly.

	## How we trained

	A Qwen2.5-3B-Instruct with LoRA (rank 16) on a single L4:

	1. SFT warm-start on 600 (alert, log_window → action+citation+rationale)
	gold examples (~12 minutes). This is just to push P(format-compliant
	response) from 0% to ~95% so GRPO has signal.
	2. GRPO curriculum for 200 steps × 4 stages (~3 hours):
	- `stage1_basic`: single-event templates (one obvious benign + one
	obvious malicious per category).
	- `stage2_multi`: malicious signal spread over a multi-event window.
	- `stage3_mixed`: benign noise interleaved with malicious events,
	stress-tests false-positive suppression.
	- `stage4_adversarial`: attacker-controlled distribution.
	3. Eval on a frozen 200-incident hold-out set (seed bands disjoint
	from training, see `tasks/registry.py`).

	Total compute: ~$3 on HF Jupyter L4. All scripts are turn-key
	(`scripts/run_full_pipeline.sh`) and the same notebook produces a
	`data/demo_examples.json` of 50 before-vs-after pairs that the deployed
	Space serves on the free CPU tier.

	## What's actually new here

	Most "self-play LLM" demos either (a) train a judge LLM and call its
	score the reward, or (b) hand-code the reward but abandon self-play.
	OpenSOC keeps both:

	- Self-play is preserved because the attacker is a real model
	emitting real (structured) parameters.
	- The reward is verifiable because it's computed from the structured
	parameters, not the attacker's narrative — so the attacker cannot
	reward-hack by writing scary text.

	The trick that makes this work is the plausibility check: a separate,
	deterministic gate on whether the attacker's params even look like a
	real incident. This is what stops the attacker from exploring
	adversarial null-spaces. We tested it with 21 anti-hack regression
	tests in `tests/test_rubric.py`.

	## Headline numbers

	Training was completed using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA (rank 16).
	All adapter checkpoints from each stage are published on HuggingFace:

	- [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) — SFT warm-start
	- [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) through [`stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial)
	- [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) — final combined adapter

	\| Metric \| `always_dismiss` (floor) \| `verifier_oracle` (ceiling) \|
	\| ------------------------------------- \| -----------------------: \| --------------------------: \|
	\| Accuracy over 200 hold-out \| 0.13 \| 1.00 \|
	\| Macro F1 over 200 hold-out \| 0.05 \| 1.00 \|
	\| Dismiss-on-malicious rate \| 1.00 \| 0.00 \|
	\| Over-react on benign \| 0.00 \| 0.00 \|

	The four diagnostic plots live in `eval/results/`:

	- `bar_dismiss_on_malicious.png` — the headline plot.
	- `bar_macro_f1.png`
	- `confusion_baseline_zero_shot.png` and `confusion_opensoc_grpo.png`
	- `training_curves.png` — reward across the four curriculum stages.

	## What's next

	- Add a third role (the "investigator") that actively queries log
	sources rather than receiving a pre-baked log window.
	- Train on real SIEM exports (CSE-CIC-IDS, Splunk Boss-of-the-SOC) and
	use the synthetic env only as a curriculum bootstrap.
	- Plug the deployed Space into a live SOAR (Tines / Shuffle) and watch
	it triage real-world tier-1 traffic.

	## Try it

	- HF Space: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env)
	(`/demo` for the human-readable before-vs-after UI)
	- Trained adapter: [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo)
	- Repo: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env)
	- Full design notes: see `README.md` in the repo.

	---

	Written for the OpenEnv Hackathon, April 2026. Code under BSD-3.