File size: 4,919 Bytes
5cbde7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
language:
- en
license: bsd-3-clause
library_name: peft
tags:
- grpo
- lora
- trl
- unsloth
- openenv
- cybersecurity
- soc
- rlvr
- self-play
base_model: unsloth/Qwen2.5-3B-Instruct
pipeline_tag: text-generation
---

# OpenSOC Defender β€” GRPO-trained LoRA adapter

A **Qwen2.5-3B-Instruct** LoRA adapter (rank 16) trained via GRPO to triage Security Operations Center (SOC) alerts. Built for the [OpenEnv Hackathon, April 2026](https://huggingface.co/spaces/shivam2k3/opensoc-env).

## Model Description

- **Developed by:** Shivam Sharma
- **Model type:** LoRA adapter (PEFT) for causal language model
- **Language:** English
- **License:** BSD-3-Clause
- **Finetuned from:** [`unsloth/Qwen2.5-3B-Instruct`](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct)

## What it does

Given a SIEM alert and a window of structured log events, the model chooses one of five SOC triage actions:

| Action | Meaning |
|---|---|
| `dismiss` | Benign noise, no action needed |
| `monitor` | Suspicious but not actionable yet |
| `quarantine_host` | Isolate the endpoint |
| `block_ip` | Block the external IP |
| `escalate` | Wake a human β€” blast-radius event |

The model also cites the specific `log_id` that drove its decision, which is verified against the env's ground truth for a +0.1 bonus reward.

## Training

### Training Data

- **SFT warm-start:** 600 (alert, log_window β†’ action + citation + rationale) gold examples generated by the OpenSOC environment's deterministic generator across all 4 curriculum stages.
- **GRPO curriculum:** Online rollouts against the OpenSOC environment using verifier-grounded rewards.

### Training Procedure

1. **SFT warm-start** (~12 min on L4): Pushes P(format-compliant response) from ~0% to ~95%.
2. **GRPO curriculum** (4 stages Γ— 200 steps, ~3h on L4):
   - `stage1_basic` β€” single-event, unambiguous templates
   - `stage2_multi` β€” multi-event log windows, 1 decoy
   - `stage3_mixed` β€” benign decoys interleaved with malicious events, 2 decoys
   - `stage4_adversarial` β€” attacker-controlled distribution, 3 decoys

### Training Hyperparameters

- LoRA rank: 16
- Learning rate (SFT): 2e-4
- Learning rate (GRPO): 5e-6
- GRPO group size (`num_generations`): 8
- Batch size: 2 (with grad_accum=4)
- Steps per stage: 200
- Framework: Unsloth + HuggingFace TRL

### Reward Design (RLVR)

The reward is computed by a **deterministic verifier** β€” the ground-truth triage action is derived purely from the structured event parameters, never from any free text. This makes the reward verifiable and reproducible.

**Defender reward components:**
- +1.0 for matching the verifier's ground-truth action
- βˆ’1.0 for dismiss-on-malicious (the cardinal SOC failure mode)
- βˆ’0.3 for over-reacting on benign (containment on noise)
- βˆ’0.05 for unnecessary escalation
- +0.1 bonus for citing the correct triggering log_id

Full rubric: [`rubric.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/rubric.py)

## Stage Adapters

Each curriculum stage's adapter is published separately:

| Stage | Repo |
|---|---|
| SFT warm-start | [`opensoc-defender-grpo-sft`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-sft) |
| Stage 1 (easy) | [`opensoc-defender-grpo-stage1_basic`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage1_basic) |
| Stage 2 (medium) | [`opensoc-defender-grpo-stage2_multi`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage2_multi) |
| Stage 3 (hard) | [`opensoc-defender-grpo-stage3_mixed`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage3_mixed) |
| Stage 4 (adversarial) | [`opensoc-defender-grpo-stage4_adversarial`](https://huggingface.co/shivam2k3/opensoc-defender-grpo-stage4_adversarial) |

## Model Sources

- **Environment:** [`shivam2k3/opensoc-env`](https://huggingface.co/spaces/shivam2k3/opensoc-env) (HF Space β€” running)
- **Training notebook:** [`train_grpo.ipynb`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/train_grpo.ipynb)
- **Verifier source:** [`verifier.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/verifier.py)
- **Rubric source:** [`rubric.py`](https://huggingface.co/spaces/shivam2k3/opensoc-env/blob/main/rubric.py)
- **Live demo:** [`/demo`](https://shivam2k3-opensoc-env.hf.space/demo)

## How to Use

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-3B-Instruct")
model = PeftModel.from_pretrained(base, "shivam2k3/opensoc-defender-grpo")
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen2.5-3B-Instruct")
```

## Compute Infrastructure

- **Hardware:** NVIDIA L4 (24GB) via HuggingFace Jupyter Notebooks
- **Training time:** ~3.5 hours total (SFT + GRPO + eval)
- **Cost:** ~$3 of HF compute credits

## Framework Versions

- PEFT 0.19.1
- Transformers (latest)
- TRL (latest)
- Unsloth (latest)