opensoc-env / docs /blog.md
shivam2k3's picture
Update blog with storytelling introduction and remove slides.pdf
8bfa130
# OpenSOC: Teaching an LLM to Triage Cyberattacks via RLVR Self-Play
*A submission for the OpenEnv Hackathon, April 2026.*
---
As we all know Anthropic has a new model called Mythos. It is claimed that Mythos has found a lot of bugs in open source projects. As LLMs improve the threat of someone using it for malicious attacks increases exponentially. Also a lot of improvements in LLMs have happened due to Reinforcement Learning. Our goal was to have a multi agents setup with an attacker and defender. They learn from each other. In this way we can improve security and always be prepared in case of an attack.
## Why we built this
By the time a security operations center (SOC) tier-1 analyst sees an
alert, the attacker may have been inside for hours. Tier-1 triage is
mostly judgement: look at one alert plus the small log window around
it, and decide whether to dismiss, monitor, quarantine, block, or
escalate. It's also where SOCs are chronically understaffed β€” a
pipeline that quietly skips real attacks because the human is asleep is
not a hypothetical risk.
LLMs *should* be able to help. But training one to do this without
poisoning ourselves on the way is hard:
1. SOC datasets are private; published ones get stale within months.
2. Subjective rewards from another LLM let the trainee reward-hack the
judge, not the world.
3. Self-play between two LLMs (one playing attacker, one defender)
collapses to a degenerate equilibrium unless the reward is *grounded*.
OpenSOC is our attempt at all three: a synthetic, self-play SOC triage
environment with a **deterministic, schema-driven verifier** as the only
source of reward signal β€” true Reinforcement Learning with Verifiable
Rewards (RLVR).
## What's in the env
The environment exposes the OpenEnv API (`/reset`, `/step`, `/state`,
`/grade`) plus a Gradio "before vs after" UI at `/demo`:
- **Schema**: a single `schema.py` defines `IncidentParams`, the only
shape the attacker can produce. Enums for event types and triage
actions, structured fields, no free-text "vibes".
- **Attacker**: must emit a structured `craft_incident` action. Free
text is allowed but ignored by the verifier.
- **Verifier**: `verifier.compute_ground_truth(params)` is a pure
function of the structured fields. Authentication failures from a
single source IP at >5 attempts/window get labeled `block_ip`,
malware_exec on a workstation gets `quarantine_host`, etc. This is
the RLVR substrate.
- **Plausibility check**: a separate `verifier.check_plausibility(params)`
rejects gibberish (e.g. malware exec with no malicious indicators),
and the attacker only earns reward when the incident passes
plausibility. This is what neutralises the classic self-play hack
where the attacker just emits noise to confuse the defender.
- **Defender**: chooses one triage action and cites the log_id that
drove the decision. Reward layers are: +1.0 for matching the verifier,
βˆ’1.0 for dismiss-on-malicious, βˆ’0.3 for over-react on benign, +0.1
bonus for the correct citation. Dismiss-on-malicious is the cardinal
failure mode in real SOCs and we grade it accordingly.
## How we trained
A Qwen2.5-3B-Instruct with LoRA (rank 16) on a single L4:
1. **SFT warm-start** on 600 (alert, log_window β†’ action+citation+rationale)
gold examples (~12 minutes). This is just to push P(format-compliant
response) from 0% to ~95% so GRPO has signal.
2. **GRPO curriculum** for 200 steps Γ— 4 stages (~3 hours):
- `stage1_basic`: single-event templates (one obvious benign + one
obvious malicious per category).
- `stage2_multi`: malicious signal spread over a multi-event window.
- `stage3_mixed`: benign noise interleaved with malicious events,
stress-tests false-positive suppression.
- `stage4_adversarial`: attacker-controlled distribution.
3. **Eval** on a frozen 200-incident hold-out set (seed bands disjoint
from training, see `tasks/registry.py`).
Total compute: ~$3 on HF Jupyter L4. All scripts are turn-key
(`scripts/run_full_pipeline.sh`) and the same notebook produces a
`data/demo_examples.json` of 50 before-vs-after pairs that the deployed
Space serves on the free CPU tier.
## What's actually new here
Most "self-play LLM" demos either (a) train a judge LLM and call its
score the reward, or (b) hand-code the reward but abandon self-play.
OpenSOC keeps both:
- **Self-play** is preserved because the attacker is a real model
emitting real (structured) parameters.
- **The reward is verifiable** because it's computed from the structured
parameters, not the attacker's narrative β€” so the attacker cannot
reward-hack by writing scary text.
The trick that makes this work is the *plausibility check*: a separate,
deterministic gate on whether the attacker's params even look like a
real incident. This is what stops the attacker from exploring
adversarial null-spaces. We tested it with 21 anti-hack regression
tests in `tests/test_rubric.py`.
## Headline numbers
Training was completed using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA (rank 16).
All adapter checkpoints from each stage are published on HuggingFace:
- [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) β€” SFT warm-start
- [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) through [`stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial)
- [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) β€” final combined adapter
| Metric | `always_dismiss` (floor) | `verifier_oracle` (ceiling) |
| ------------------------------------- | -----------------------: | --------------------------: |
| Accuracy over 200 hold-out | 0.13 | 1.00 |
| Macro F1 over 200 hold-out | 0.05 | 1.00 |
| Dismiss-on-malicious rate | 1.00 | 0.00 |
| Over-react on benign | 0.00 | 0.00 |
The four diagnostic plots live in `eval/results/`:
- `bar_dismiss_on_malicious.png` β€” the headline plot.
- `bar_macro_f1.png`
- `confusion_baseline_zero_shot.png` and `confusion_opensoc_grpo.png`
- `training_curves.png` β€” reward across the four curriculum stages.
## What's next
- Add a *third* role (the "investigator") that actively queries log
sources rather than receiving a pre-baked log window.
- Train on real SIEM exports (CSE-CIC-IDS, Splunk Boss-of-the-SOC) and
use the synthetic env only as a curriculum bootstrap.
- Plug the deployed Space into a live SOAR (Tines / Shuffle) and watch
it triage real-world tier-1 traffic.
## Try it
- HF Space: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env)
(`/demo` for the human-readable before-vs-after UI)
- Trained adapter: [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo)
- Repo: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env)
- Full design notes: see `README.md` in the repo.
---
*Written for the OpenEnv Hackathon, April 2026. Code under BSD-3.*