| --- |
| language: |
| - en |
| license: bsd-3-clause |
| library_name: peft |
| tags: |
| - grpo |
| - lora |
| - trl |
| - unsloth |
| - openenv |
| - cybersecurity |
| - soc |
| - rlvr |
| - self-play |
| base_model: unsloth/Qwen2.5-3B-Instruct |
| pipeline_tag: text-generation |
| --- |
| |
| # OpenSOC Defender β GRPO-trained LoRA adapter |
|
|
| A **Qwen2.5-3B-Instruct** LoRA adapter (rank 16) trained via GRPO to triage Security Operations Center (SOC) alerts. Built for the [OpenEnv Hackathon, April 2026](https://huggingface.co/spaces/shivam2k3/opensoc-env). |
|
|
| ## Model Description |
|
|
| - **Developed by:** Shivam Sharma |
| - **Model type:** LoRA adapter (PEFT) for causal language model |
| - **Language:** English |
| - **License:** BSD-3-Clause |
| - **Finetuned from:** [`unsloth/Qwen2.5-3B-Instruct`](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct) |
|
|
| ## What it does |
|
|
| Given a SIEM alert and a window of structured log events, the model chooses one of five SOC triage actions: |
|
|
| | Action | Meaning | |
| |---|---| |
| | `dismiss` | Benign noise, no action needed | |
| | `monitor` | Suspicious but not actionable yet | |
| | `quarantine_host` | Isolate the endpoint | |
| | `block_ip` | Block the external IP | |
| | `escalate` | Wake a human β blast-radius event | |
|
|
| The model also cites the specific `log_id` that drove its decision, which is verified against the env's ground truth for a +0.1 bonus reward. |
|
|
| ## Training |
|
|
| ### Training Data |
|
|
| - **SFT warm-start:** 600 (alert, log_window β action + citation + rationale) gold examples generated by the OpenSOC environment's deterministic generator across all 4 curriculum stages. |
| - **GRPO curriculum:** Online rollouts against the OpenSOC environment using verifier-grounded rewards. |
| |
| ### Training Procedure |
| |
| 1. **SFT warm-start** (~12 min on L4): Pushes P(format-compliant response) from ~0% to ~95%. |
| 2. **GRPO curriculum** (4 stages Γ 200 steps, ~3h on L4): |
| - `stage1_basic` β single-event, unambiguous templates |
| - `stage2_multi` β multi-event log windows, 1 decoy |
| - `stage3_mixed` β benign decoys interleaved with malicious events, 2 decoys |
| - `stage4_adversarial` β attacker-controlled distribution, 3 decoys |
|
|
| ### Training Hyperparameters |
|
|
| - LoRA rank: 16 |
| - Learning rate (SFT): 2e-4 |
| - Learning rate (GRPO): 5e-6 |
| - GRPO group size (`num_generations`): 8 |
| - Batch size: 2 (with grad_accum=4) |
| - Steps per stage: 200 |
| - Framework: Unsloth + HuggingFace TRL |
| |
| ### Reward Design (RLVR) |
| |
| The reward is computed by a **deterministic verifier** β the ground-truth triage action is derived purely from the structured event parameters, never from any free text. This makes the reward verifiable and reproducible. |
| |
| **Defender reward components:** |
| - +1.0 for matching the verifier's ground-truth action |
| - β1.0 for dismiss-on-malicious (the cardinal SOC failure mode) |
| - β0.3 for over-reacting on benign (containment on noise) |
| - β0.05 for unnecessary escalation |
| - +0.1 bonus for citing the correct triggering log_id |
|
|
| Full rubric: [`rubric.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/rubric.py) |
|
|
| ## Stage Adapters |
|
|
| Each curriculum stage's adapter is published separately: |
|
|
| | Stage | Repo | |
| |---|---| |
| | SFT warm-start | [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) | |
| | Stage 1 (easy) | [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) | |
| | Stage 2 (medium) | [`opensoc-defender-grpo-stage2_multi`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage2_multi) | |
| | Stage 3 (hard) | [`opensoc-defender-grpo-stage3_mixed`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage3_mixed) | |
| | Stage 4 (adversarial) | [`opensoc-defender-grpo-stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial) | |
|
|
| ## Model Sources |
|
|
| - **Environment:** [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) (HF Space β running) |
| - **Training notebook:** [`train_grpo.ipynb`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/train_grpo.ipynb) |
| - **Verifier source:** [`verifier.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/verifier.py) |
| - **Rubric source:** [`rubric.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/rubric.py) |
| - **Live demo:** [`/demo`](https://shivam2k3-opensoc-env.hf.space/demo) |
|
|
| ## How to Use |
|
|
| ```python |
| from peft import PeftModel |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| base = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-3B-Instruct") |
| model = PeftModel.from_pretrained(base, "shivam2k3/opensoc-defender-grpo") |
| tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-3B-Instruct") |
| ``` |
|
|
| ## Compute Infrastructure |
|
|
| - **Hardware:** NVIDIA L4 (24GB) via HuggingFace Jupyter Notebooks |
| - **Training time:** ~3.5 hours total (SFT + GRPO + eval) |
| - **Cost:** ~$3 of HF compute credits |
|
|
| ## Framework Versions |
|
|
| - PEFT 0.19.1 |
| - Transformers (latest) |
| - TRL (latest) |
| - Unsloth (latest) |
|
|