# OpenSOC: Teaching an LLM to Triage Cyberattacks via RLVR Self-Play

*A submission for the OpenEnv Hackathon, April 2026.*

---

As we all know Anthropic has a new model called Mythos. It is claimed that Mythos has found a lot of bugs in open source projects. As LLMs improve the threat of someone using it for malicious attacks increases exponentially. Also a lot of improvements in LLMs have happened due to Reinforcement Learning. Our goal was to have a multi agents setup with an attacker and defender. They learn from each other. In this way we can improve security and always be prepared in case of an attack.

## Why we built this

By the time a security operations center (SOC) tier-1 analyst sees an
alert, the attacker may have been inside for hours.  Tier-1 triage is
mostly judgement: look at one alert plus the small log window around
it, and decide whether to dismiss, monitor, quarantine, block, or
escalate.  It's also where SOCs are chronically understaffed — a
pipeline that quietly skips real attacks because the human is asleep is
not a hypothetical risk.

LLMs *should* be able to help.  But training one to do this without
poisoning ourselves on the way is hard:

1. SOC datasets are private; published ones get stale within months.
2. Subjective rewards from another LLM let the trainee reward-hack the
   judge, not the world.
3. Self-play between two LLMs (one playing attacker, one defender)
   collapses to a degenerate equilibrium unless the reward is *grounded*.

OpenSOC is our attempt at all three: a synthetic, self-play SOC triage
environment with a **deterministic, schema-driven verifier** as the only
source of reward signal — true Reinforcement Learning with Verifiable
Rewards (RLVR).

## What's in the env

The environment exposes the OpenEnv API (`/reset`, `/step`, `/state`,
`/grade`) plus a Gradio "before vs after" UI at `/demo`:

- **Schema**: a single `schema.py` defines `IncidentParams`, the only
  shape the attacker can produce.  Enums for event types and triage
  actions, structured fields, no free-text "vibes".
- **Attacker**: must emit a structured `craft_incident` action.  Free
  text is allowed but ignored by the verifier.
- **Verifier**: `verifier.compute_ground_truth(params)` is a pure
  function of the structured fields.  Authentication failures from a
  single source IP at >5 attempts/window get labeled `block_ip`,
  malware_exec on a workstation gets `quarantine_host`, etc.  This is
  the RLVR substrate.
- **Plausibility check**: a separate `verifier.check_plausibility(params)`
  rejects gibberish (e.g. malware exec with no malicious indicators),
  and the attacker only earns reward when the incident passes
  plausibility.  This is what neutralises the classic self-play hack
  where the attacker just emits noise to confuse the defender.
- **Defender**: chooses one triage action and cites the log_id that
  drove the decision.  Reward layers are: +1.0 for matching the verifier,
  −1.0 for dismiss-on-malicious, −0.3 for over-react on benign, +0.1
  bonus for the correct citation.  Dismiss-on-malicious is the cardinal
  failure mode in real SOCs and we grade it accordingly.

## How we trained

A Qwen2.5-3B-Instruct with LoRA (rank 16) on a single L4:

1. **SFT warm-start** on 600 (alert, log_window → action+citation+rationale)
   gold examples (~12 minutes).  This is just to push P(format-compliant
   response) from 0% to ~95% so GRPO has signal.
2. **GRPO curriculum** for 200 steps × 4 stages (~3 hours):
   - `stage1_basic`: single-event templates (one obvious benign + one
     obvious malicious per category).
   - `stage2_multi`: malicious signal spread over a multi-event window.
   - `stage3_mixed`: benign noise interleaved with malicious events,
     stress-tests false-positive suppression.
   - `stage4_adversarial`: attacker-controlled distribution.
3. **Eval** on a frozen 200-incident hold-out set (seed bands disjoint
   from training, see `tasks/registry.py`).

Total compute: ~$3 on HF Jupyter L4.  All scripts are turn-key
(`scripts/run_full_pipeline.sh`) and the same notebook produces a
`data/demo_examples.json` of 50 before-vs-after pairs that the deployed
Space serves on the free CPU tier.

## What's actually new here

Most "self-play LLM" demos either (a) train a judge LLM and call its
score the reward, or (b) hand-code the reward but abandon self-play.
OpenSOC keeps both:

- **Self-play** is preserved because the attacker is a real model
  emitting real (structured) parameters.
- **The reward is verifiable** because it's computed from the structured
  parameters, not the attacker's narrative — so the attacker cannot
  reward-hack by writing scary text.

The trick that makes this work is the *plausibility check*: a separate,
deterministic gate on whether the attacker's params even look like a
real incident.  This is what stops the attacker from exploring
adversarial null-spaces.  We tested it with 21 anti-hack regression
tests in `tests/test_rubric.py`.

## Headline numbers

Training was completed using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA (rank 16).
All adapter checkpoints from each stage are published on HuggingFace:

- [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) — SFT warm-start
- [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) through [`stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial)
- [`opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo) — final combined adapter

| Metric                                | `always_dismiss` (floor) | `verifier_oracle` (ceiling) |
| ------------------------------------- | -----------------------: | --------------------------: |
| Accuracy over 200 hold-out            |                    0.13  |                       1.00  |
| Macro F1 over 200 hold-out            |                    0.05  |                       1.00  |
| Dismiss-on-malicious rate             |                    1.00  |                       0.00  |
| Over-react on benign                  |                    0.00  |                       0.00  |

The four diagnostic plots live in `eval/results/`:

- `bar_dismiss_on_malicious.png` — the headline plot.
- `bar_macro_f1.png`
- `confusion_baseline_zero_shot.png` and `confusion_opensoc_grpo.png`
- `training_curves.png` — reward across the four curriculum stages.

## What's next

- Add a *third* role (the "investigator") that actively queries log
  sources rather than receiving a pre-baked log window.
- Train on real SIEM exports (CSE-CIC-IDS, Splunk Boss-of-the-SOC) and
  use the synthetic env only as a curriculum bootstrap.
- Plug the deployed Space into a live SOAR (Tines / Shuffle) and watch
  it triage real-world tier-1 traffic.

## Try it

- HF Space: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env)
  (`/demo` for the human-readable before-vs-after UI)
- Trained adapter: [`shivam2k3/opensoc-defender-grpo`](https://huggingface.co/shivam2k3/opensoc-defender-grpo)
- Repo: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env)
- Full design notes: see `README.md` in the repo.

---

*Written for the OpenEnv Hackathon, April 2026.  Code under BSD-3.*