Update README with trained model links, polish blog, add model card

5cbde7b 25 days ago

4.92 kB

	---
	language:
	- en
	license: bsd-3-clause
	library_name: peft
	tags:
	- grpo
	- lora
	- trl
	- unsloth
	- openenv
	- cybersecurity
	- soc
	- rlvr
	- self-play
	base_model: unsloth/Qwen2.5-3B-Instruct
	pipeline_tag: text-generation
	---

	# OpenSOC Defender — GRPO-trained LoRA adapter

	A Qwen2.5-3B-Instruct LoRA adapter (rank 16) trained via GRPO to triage Security Operations Center (SOC) alerts. Built for the [OpenEnv Hackathon, April 2026](https://huggingface.co/spaces/shivam2k3/opensoc-env).

	## Model Description

	- Developed by: Shivam Sharma
	- Model type: LoRA adapter (PEFT) for causal language model
	- Language: English
	- License: BSD-3-Clause
	- Finetuned from: [`unsloth/Qwen2.5-3B-Instruct`](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct)

	## What it does

	Given a SIEM alert and a window of structured log events, the model chooses one of five SOC triage actions:

	\| Action \| Meaning \|
	\|---\|---\|
	\| `dismiss` \| Benign noise, no action needed \|
	\| `monitor` \| Suspicious but not actionable yet \|
	\| `quarantine_host` \| Isolate the endpoint \|
	\| `block_ip` \| Block the external IP \|
	\| `escalate` \| Wake a human — blast-radius event \|

	The model also cites the specific `log_id` that drove its decision, which is verified against the env's ground truth for a +0.1 bonus reward.

	## Training

	### Training Data

	- SFT warm-start: 600 (alert, log_window → action + citation + rationale) gold examples generated by the OpenSOC environment's deterministic generator across all 4 curriculum stages.
	- GRPO curriculum: Online rollouts against the OpenSOC environment using verifier-grounded rewards.

	### Training Procedure

	1. SFT warm-start (~12 min on L4): Pushes P(format-compliant response) from ~0% to ~95%.
	2. GRPO curriculum (4 stages × 200 steps, ~3h on L4):
	- `stage1_basic` — single-event, unambiguous templates
	- `stage2_multi` — multi-event log windows, 1 decoy
	- `stage3_mixed` — benign decoys interleaved with malicious events, 2 decoys
	- `stage4_adversarial` — attacker-controlled distribution, 3 decoys

	### Training Hyperparameters

	- LoRA rank: 16
	- Learning rate (SFT): 2e-4
	- Learning rate (GRPO): 5e-6
	- GRPO group size (`num_generations`): 8
	- Batch size: 2 (with grad_accum=4)
	- Steps per stage: 200
	- Framework: Unsloth + HuggingFace TRL

	### Reward Design (RLVR)

	The reward is computed by a deterministic verifier — the ground-truth triage action is derived purely from the structured event parameters, never from any free text. This makes the reward verifiable and reproducible.

	Defender reward components:
	- +1.0 for matching the verifier's ground-truth action
	- −1.0 for dismiss-on-malicious (the cardinal SOC failure mode)
	- −0.3 for over-reacting on benign (containment on noise)
	- −0.05 for unnecessary escalation
	- +0.1 bonus for citing the correct triggering log_id

	Full rubric: [`rubric.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/rubric.py)

	## Stage Adapters

	Each curriculum stage's adapter is published separately:

	\| Stage \| Repo \|
	\|---\|---\|
	\| SFT warm-start \| [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) \|
	\| Stage 1 (easy) \| [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) \|
	\| Stage 2 (medium) \| [`opensoc-defender-grpo-stage2_multi`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage2_multi) \|
	\| Stage 3 (hard) \| [`opensoc-defender-grpo-stage3_mixed`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage3_mixed) \|
	\| Stage 4 (adversarial) \| [`opensoc-defender-grpo-stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial) \|

	## Model Sources

	- Environment: [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) (HF Space — running)
	- Training notebook: [`train_grpo.ipynb`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/train_grpo.ipynb)
	- Verifier source: [`verifier.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/verifier.py)
	- Rubric source: [`rubric.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/rubric.py)
	- Live demo: [`/demo`](https://shivam2k3-opensoc-env.hf.space/demo)

	## How to Use

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer

	base = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-3B-Instruct")
	model = PeftModel.from_pretrained(base, "shivam2k3/opensoc-defender-grpo")
	tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-3B-Instruct")
	```

	## Compute Infrastructure

	- Hardware: NVIDIA L4 (24GB) via HuggingFace Jupyter Notebooks
	- Training time: ~3.5 hours total (SFT + GRPO + eval)
	- Cost: ~$3 of HF compute credits

	## Framework Versions

	- PEFT 0.19.1
	- Transformers (latest)
	- TRL (latest)
	- Unsloth (latest)