OpenSOC: Teaching an LLM to Triage Cyberattacks via RLVR Self-Play

A submission for the OpenEnv Hackathon, April 2026.

As we all know Anthropic has a new model called Mythos. It is claimed that Mythos has found a lot of bugs in open source projects. As LLMs improve the threat of someone using it for malicious attacks increases exponentially. Also a lot of improvements in LLMs have happened due to Reinforcement Learning. Our goal was to have a multi agents setup with an attacker and defender. They learn from each other. In this way we can improve security and always be prepared in case of an attack.

Why we built this

By the time a security operations center (SOC) tier-1 analyst sees an alert, the attacker may have been inside for hours. Tier-1 triage is mostly judgement: look at one alert plus the small log window around it, and decide whether to dismiss, monitor, quarantine, block, or escalate. It's also where SOCs are chronically understaffed — a pipeline that quietly skips real attacks because the human is asleep is not a hypothetical risk.

LLMs should be able to help. But training one to do this without poisoning ourselves on the way is hard:

SOC datasets are private; published ones get stale within months.
Subjective rewards from another LLM let the trainee reward-hack the judge, not the world.
Self-play between two LLMs (one playing attacker, one defender) collapses to a degenerate equilibrium unless the reward is grounded.

OpenSOC is our attempt at all three: a synthetic, self-play SOC triage environment with a deterministic, schema-driven verifier as the only source of reward signal — true Reinforcement Learning with Verifiable Rewards (RLVR).

What's in the env

The environment exposes the OpenEnv API (/reset, /step, /state, /grade) plus a Gradio "before vs after" UI at /demo:

Schema: a single schema.py defines IncidentParams, the only shape the attacker can produce. Enums for event types and triage actions, structured fields, no free-text "vibes".
Attacker: must emit a structured craft_incident action. Free text is allowed but ignored by the verifier.
Verifier: verifier.compute_ground_truth(params) is a pure function of the structured fields. Authentication failures from a single source IP at >5 attempts/window get labeled block_ip, malware_exec on a workstation gets quarantine_host, etc. This is the RLVR substrate.
Plausibility check: a separate verifier.check_plausibility(params) rejects gibberish (e.g. malware exec with no malicious indicators), and the attacker only earns reward when the incident passes plausibility. This is what neutralises the classic self-play hack where the attacker just emits noise to confuse the defender.
Defender: chooses one triage action and cites the log_id that drove the decision. Reward layers are: +1.0 for matching the verifier, −1.0 for dismiss-on-malicious, −0.3 for over-react on benign, +0.1 bonus for the correct citation. Dismiss-on-malicious is the cardinal failure mode in real SOCs and we grade it accordingly.

How we trained

A Qwen2.5-3B-Instruct with LoRA (rank 16) on a single L4:

SFT warm-start on 600 (alert, log_window → action+citation+rationale) gold examples (~12 minutes). This is just to push P(format-compliant response) from 0% to ~95% so GRPO has signal.
GRPO curriculum for 200 steps × 4 stages (~3 hours):
- stage1_basic: single-event templates (one obvious benign + one obvious malicious per category).
- stage2_multi: malicious signal spread over a multi-event window.
- stage3_mixed: benign noise interleaved with malicious events, stress-tests false-positive suppression.
- stage4_adversarial: attacker-controlled distribution.
Eval on a frozen 200-incident hold-out set (seed bands disjoint from training, see tasks/registry.py).

Total compute: ~$3 on HF Jupyter L4. All scripts are turn-key (scripts/run_full_pipeline.sh) and the same notebook produces a data/demo_examples.json of 50 before-vs-after pairs that the deployed Space serves on the free CPU tier.

What's actually new here

Most "self-play LLM" demos either (a) train a judge LLM and call its score the reward, or (b) hand-code the reward but abandon self-play. OpenSOC keeps both:

Self-play is preserved because the attacker is a real model emitting real (structured) parameters.
The reward is verifiable because it's computed from the structured parameters, not the attacker's narrative — so the attacker cannot reward-hack by writing scary text.

The trick that makes this work is the plausibility check: a separate, deterministic gate on whether the attacker's params even look like a real incident. This is what stops the attacker from exploring adversarial null-spaces. We tested it with 21 anti-hack regression tests in tests/test_rubric.py.

Headline numbers

Training was completed using GRPO with a 4-stage curriculum on Qwen2.5-3B-Instruct with LoRA (rank 16). All adapter checkpoints from each stage are published on HuggingFace:

opensoc-defender-grpo-sft — SFT warm-start
opensoc-defender-grpo-stage1_basic through stage4_adversarial
opensoc-defender-grpo — final combined adapter

Metric	`always_dismiss` (floor)	`verifier_oracle` (ceiling)
Accuracy over 200 hold-out	0.13	1.00
Macro F1 over 200 hold-out	0.05	1.00
Dismiss-on-malicious rate	1.00	0.00
Over-react on benign	0.00	0.00

The four diagnostic plots live in eval/results/:

bar_dismiss_on_malicious.png — the headline plot.
bar_macro_f1.png
confusion_baseline_zero_shot.png and confusion_opensoc_grpo.png
training_curves.png — reward across the four curriculum stages.

What's next

Add a third role (the "investigator") that actively queries log sources rather than receiving a pre-baked log window.
Train on real SIEM exports (CSE-CIC-IDS, Splunk Boss-of-the-SOC) and use the synthetic env only as a curriculum bootstrap.
Plug the deployed Space into a live SOAR (Tines / Shuffle) and watch it triage real-world tier-1 traffic.

Try it

HF Space: shivam2k3/opensoc-env (/demo for the human-readable before-vs-after UI)
Trained adapter: shivam2k3/opensoc-defender-grpo
Repo: shivam2k3/opensoc-env
Full design notes: see README.md in the repo.

Written for the OpenEnv Hackathon, April 2026. Code under BSD-3.