# OpenSOC: Teaching an LLM to Triage Cyberattacks via RLVR Self-Play *A submission for the OpenEnv Hackathon, April 2026.* --- As we all know Anthropic has a new model called Mythos. It is claimed that Mythos has found a lot of bugs in open source projects. As LLMs improve the threat of someone using it for malicious attacks increases exponentially. Also a lot of improvements in LLMs have happened due to Reinforcement Learning. Our goal was to have a multi agents setup with an attacker and defender. They learn from each other. In this way we can improve security and always be prepared in case of an attack. ## Why we built this By the time a security operations center (SOC) tier-1 analyst sees an alert, the attacker may have been inside for hours. Tier-1 triage is mostly judgement: look at one alert plus the small log window around it, and decide whether to dismiss, monitor, quarantine, block, or escalate. It's also where SOCs are chronically understaffed — a pipeline that quietly skips real attacks because the human is asleep is not a hypothetical risk. LLMs *should* be able to help. But training one to do this without poisoning ourselves on the way is hard: 1. SOC datasets are private; published ones get stale within months. 2. Subjective rewards from another LLM let the trainee reward-hack the judge, not the world. 3. Self-play between two LLMs (one playing attacker, one defender) collapses to a degenerate equilibrium unless the reward is *grounded*. OpenSOC is our attempt at all three: a synthetic, self-play SOC triage environment with a **deterministic, schema-driven verifier** as the only source of reward signal — true Reinforcement Learning with Verifiable Rewards (RLVR). ## What's in the env The environment exposes the OpenEnv API (`/reset`, `/step`, `/state`, `/grade`) plus a Gradio "before vs after" UI at `/demo`: - **Schema**: a single `schema.py` defines `IncidentParams`, the only shape the attacker can produce. Enums for event types and triage actions, structured fields, no free-text "vibes". - **Attacker**: must emit a structured `craft_incident` action. Free text is allowed but ignored by the verifier. - **Verifier**: `verifier.compute_ground_truth(params)` is a pure function of the structured fields. Authentication failures from a single source IP at >5 attempts/window get labeled `block_ip`, malware_exec on a workstation gets `quarantine_host`, etc. This is the RLVR substrate. - **Plausibility check**: a separate `verifier.check_plausibility(params)` rejects gibberish (e.g. malware exec with no malicious indicators), and the attacker only earns reward when the incident passes plausibility. This is what neutralises the classic self-play hack where the attacker just emits noise to confuse the defender. - **Defender**: chooses one triage action and cites the log_id that drove the decision. Reward layers are: +1.0 for matching the verifier, −1.0 for dismiss-on-malicious, −0.3 for over-react on benign, +0.1 bonus for the correct citation. Dismiss-on-malicious is the cardinal failure mode in real SOCs and we grade it accordingly. ## How we trained A Qwen2.5-3B-Instruct with LoRA (rank 16) on a single L4: 1. **SFT warm-start** on 600 (alert, log_window → action+citation+rationale) gold examples (~12 minutes). This is just to push P(format-compliant response) from 0% to ~95% so GRPO has signal. 2. **GRPO curriculum** for 200 steps × 4 stages (~3 hours): - `stage1_basic`: single-event templates (one obvious benign + one obvious malicious per category). - `stage2_multi`: malicious signal spread over a multi-event window. - `stage3_mixed`: benign noise interleaved with malicious events, stress-tests false-positive suppression. - `stage4_adversarial`: attacker-controlled distribution. 3. **Eval** on a frozen 200-incident hold-out set (seed bands disjoint from training, see `tasks/registry.py`). Total compute: ~$3 on HF Jupyter L4. All scripts are turn-key (`scripts/run_full_pipeline.sh`) and the same notebook produces a `data/demo_examples.json` of 50 before-vs-after pairs that the deployed Space serves on the free CPU tier. ## What's actually new here Most "self-play LLM" demos either (a) train a judge LLM and call its score the reward, or (b) hand-code the reward but abandon self-play. OpenSOC keeps both: - **Self-play** is preserved because the attacker is a real model emitting real (structured) parameters. - **The reward is verifiable** because it's computed from the structured parameters, not the attacker's narrative — so the attacker cannot reward-hack by writing scary text. The trick that makes this work is the *plausibility check*: a separate, deterministic gate on whether the attacker's params even look like a real incident. This is what stops the attacker from exploring adversarial null-spaces. We tested it with 21 anti-hack regression tests in `tests/test_rubric.py`. ## Headline numbers Training was completed using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA (rank 16). All adapter checkpoints from each stage are published on HuggingFace: - [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) — SFT warm-start - [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) through [`stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial) - [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) — final combined adapter | Metric | `always_dismiss` (floor) | `verifier_oracle` (ceiling) | | ------------------------------------- | -----------------------: | --------------------------: | | Accuracy over 200 hold-out | 0.13 | 1.00 | | Macro F1 over 200 hold-out | 0.05 | 1.00 | | Dismiss-on-malicious rate | 1.00 | 0.00 | | Over-react on benign | 0.00 | 0.00 | The four diagnostic plots live in `eval/results/`: - `bar_dismiss_on_malicious.png` — the headline plot. - `bar_macro_f1.png` - `confusion_baseline_zero_shot.png` and `confusion_opensoc_grpo.png` - `training_curves.png` — reward across the four curriculum stages. ## What's next - Add a *third* role (the "investigator") that actively queries log sources rather than receiving a pre-baked log window. - Train on real SIEM exports (CSE-CIC-IDS, Splunk Boss-of-the-SOC) and use the synthetic env only as a curriculum bootstrap. - Plug the deployed Space into a live SOAR (Tines / Shuffle) and watch it triage real-world tier-1 traffic. ## Try it - HF Space: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) (`/demo` for the human-readable before-vs-after UI) - Trained adapter: [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) - Repo: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) - Full design notes: see `README.md` in the repo. --- *Written for the OpenEnv Hackathon, April 2026. Code under BSD-3.*