| # OpenSOC: Teaching an LLM to Triage Cyberattacks via RLVR Self-Play |
|
|
| *A submission for the OpenEnv Hackathon, April 2026.* |
|
|
| --- |
|
|
| As we all know Anthropic has a new model called Mythos. It is claimed that Mythos has found a lot of bugs in open source projects. As LLMs improve the threat of someone using it for malicious attacks increases exponentially. Also a lot of improvements in LLMs have happened due to Reinforcement Learning. Our goal was to have a multi agents setup with an attacker and defender. They learn from each other. In this way we can improve security and always be prepared in case of an attack. |
|
|
| ## Why we built this |
|
|
| By the time a security operations center (SOC) tier-1 analyst sees an |
| alert, the attacker may have been inside for hours. Tier-1 triage is |
| mostly judgement: look at one alert plus the small log window around |
| it, and decide whether to dismiss, monitor, quarantine, block, or |
| escalate. It's also where SOCs are chronically understaffed β a |
| pipeline that quietly skips real attacks because the human is asleep is |
| not a hypothetical risk. |
|
|
| LLMs *should* be able to help. But training one to do this without |
| poisoning ourselves on the way is hard: |
|
|
| 1. SOC datasets are private; published ones get stale within months. |
| 2. Subjective rewards from another LLM let the trainee reward-hack the |
| judge, not the world. |
| 3. Self-play between two LLMs (one playing attacker, one defender) |
| collapses to a degenerate equilibrium unless the reward is *grounded*. |
|
|
| OpenSOC is our attempt at all three: a synthetic, self-play SOC triage |
| environment with a **deterministic, schema-driven verifier** as the only |
| source of reward signal β true Reinforcement Learning with Verifiable |
| Rewards (RLVR). |
|
|
| ## What's in the env |
|
|
| The environment exposes the OpenEnv API (`/reset`, `/step`, `/state`, |
| `/grade`) plus a Gradio "before vs after" UI at `/demo`: |
|
|
| - **Schema**: a single `schema.py` defines `IncidentParams`, the only |
| shape the attacker can produce. Enums for event types and triage |
| actions, structured fields, no free-text "vibes". |
| - **Attacker**: must emit a structured `craft_incident` action. Free |
| text is allowed but ignored by the verifier. |
| - **Verifier**: `verifier.compute_ground_truth(params)` is a pure |
| function of the structured fields. Authentication failures from a |
| single source IP at >5 attempts/window get labeled `block_ip`, |
| malware_exec on a workstation gets `quarantine_host`, etc. This is |
| the RLVR substrate. |
| - **Plausibility check**: a separate `verifier.check_plausibility(params)` |
| rejects gibberish (e.g. malware exec with no malicious indicators), |
| and the attacker only earns reward when the incident passes |
| plausibility. This is what neutralises the classic self-play hack |
| where the attacker just emits noise to confuse the defender. |
| - **Defender**: chooses one triage action and cites the log_id that |
| drove the decision. Reward layers are: +1.0 for matching the verifier, |
| β1.0 for dismiss-on-malicious, β0.3 for over-react on benign, +0.1 |
| bonus for the correct citation. Dismiss-on-malicious is the cardinal |
| failure mode in real SOCs and we grade it accordingly. |
| |
| ## How we trained |
| |
| A Qwen2.5-3B-Instruct with LoRA (rank 16) on a single L4: |
| |
| 1. **SFT warm-start** on 600 (alert, log_window β action+citation+rationale) |
| gold examples (~12 minutes). This is just to push P(format-compliant |
| response) from 0% to ~95% so GRPO has signal. |
| 2. **GRPO curriculum** for 200 steps Γ 4 stages (~3 hours): |
| - `stage1_basic`: single-event templates (one obvious benign + one |
| obvious malicious per category). |
| - `stage2_multi`: malicious signal spread over a multi-event window. |
| - `stage3_mixed`: benign noise interleaved with malicious events, |
| stress-tests false-positive suppression. |
| - `stage4_adversarial`: attacker-controlled distribution. |
| 3. **Eval** on a frozen 200-incident hold-out set (seed bands disjoint |
| from training, see `tasks/registry.py`). |
| |
| Total compute: ~$3 on HF Jupyter L4. All scripts are turn-key |
| (`scripts/run_full_pipeline.sh`) and the same notebook produces a |
| `data/demo_examples.json` of 50 before-vs-after pairs that the deployed |
| Space serves on the free CPU tier. |
|
|
| ## What's actually new here |
|
|
| Most "self-play LLM" demos either (a) train a judge LLM and call its |
| score the reward, or (b) hand-code the reward but abandon self-play. |
| OpenSOC keeps both: |
|
|
| - **Self-play** is preserved because the attacker is a real model |
| emitting real (structured) parameters. |
| - **The reward is verifiable** because it's computed from the structured |
| parameters, not the attacker's narrative β so the attacker cannot |
| reward-hack by writing scary text. |
|
|
| The trick that makes this work is the *plausibility check*: a separate, |
| deterministic gate on whether the attacker's params even look like a |
| real incident. This is what stops the attacker from exploring |
| adversarial null-spaces. We tested it with 21 anti-hack regression |
| tests in `tests/test_rubric.py`. |
|
|
| ## Headline numbers |
|
|
| Training was completed using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA (rank 16). |
| All adapter checkpoints from each stage are published on HuggingFace: |
|
|
| - [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) β SFT warm-start |
| - [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) through [`stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial) |
| - [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) β final combined adapter |
|
|
| | Metric | `always_dismiss` (floor) | `verifier_oracle` (ceiling) | |
| | ------------------------------------- | -----------------------: | --------------------------: | |
| | Accuracy over 200 hold-out | 0.13 | 1.00 | |
| | Macro F1 over 200 hold-out | 0.05 | 1.00 | |
| | Dismiss-on-malicious rate | 1.00 | 0.00 | |
| | Over-react on benign | 0.00 | 0.00 | |
|
|
| The four diagnostic plots live in `eval/results/`: |
|
|
| - `bar_dismiss_on_malicious.png` β the headline plot. |
| - `bar_macro_f1.png` |
| - `confusion_baseline_zero_shot.png` and `confusion_opensoc_grpo.png` |
| - `training_curves.png` β reward across the four curriculum stages. |
|
|
| ## What's next |
|
|
| - Add a *third* role (the "investigator") that actively queries log |
| sources rather than receiving a pre-baked log window. |
| - Train on real SIEM exports (CSE-CIC-IDS, Splunk Boss-of-the-SOC) and |
| use the synthetic env only as a curriculum bootstrap. |
| - Plug the deployed Space into a live SOAR (Tines / Shuffle) and watch |
| it triage real-world tier-1 traffic. |
|
|
| ## Try it |
|
|
| - HF Space: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) |
| (`/demo` for the human-readable before-vs-after UI) |
| - Trained adapter: [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) |
| - Repo: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) |
| - Full design notes: see `README.md` in the repo. |
|
|
| --- |
|
|
| *Written for the OpenEnv Hackathon, April 2026. Code under BSD-3.* |
|
|