File size: 5,149 Bytes
63f77ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b65ee05
63f77ad
b65ee05
63f77ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b65ee05
63f77ad
b65ee05
63f77ad
b65ee05
 
 
63f77ad
 
b65ee05
63f77ad
 
 
b65ee05
63f77ad
 
b65ee05
63f77ad
b65ee05
 
 
63f77ad
b65ee05
 
 
 
63f77ad
 
 
 
 
 
b65ee05
63f77ad
b65ee05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63f77ad
b65ee05
 
 
63f77ad
b65ee05
 
 
63f77ad
b65ee05
 
 
 
 
 
 
 
 
63f77ad
b65ee05
 
 
63f77ad
 
 
b65ee05
63f77ad
b65ee05
 
 
 
 
63f77ad
 
 
 
b65ee05
63f77ad
b65ee05
 
63f77ad
 
 
b65ee05
63f77ad
b65ee05
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
language:
- en
license: apache-2.0
tags:
- content-moderation
- safety
- guardrails
- multi-label-classification
- liquid-ai
- lfm-350m
- sentinel-slm
- lora
- peft
base_model: LiquidAI/LFM2-350M
datasets:
- custom-balanced-rail-b
pipeline_tag: text-classification
library_name: transformers
metrics:
- f1
---

# πŸ›‘οΈ Sentinel Rail B: Policy Guard (350M)

**Sentinel Rail B** is a lightweight, efficient **multi-label classifier** designed to detect 7 distinct categories of policy violations in text. 

> **Architecture Note**: This model uses a custom classification head on top of the **LiquidAI LFM2-350M** base model. The repository contains the LoRA adapter weights (`adapter_model.safetensors`) AND the separate classifier head weights (`classifier.pt`).

---

## πŸ“Š Performance

| Metric | Score |
|--------|-------|
| **F1 Micro** | 0.7647 |
| **F1 Macro** | 0.7793 |
| **Hamming Loss** | 0.0466 |

### Per-Category F1 Scores

| Category | F1 Score | Status |
|----------|----------|--------|
| **Privacy** | 0.9927 | 🟒 Excellent |
| **Illegal** | 0.9750 | 🟒 Excellent |
| **ChildSafety** | 0.7783 | 🟒 Good |
| **Violence** | 0.7727 | 🟒 Good |
| **Sexual** | 0.7415 | 🟒 Good |
| **Harassment** | 0.6160 | 🟑 Fair |
| **Hate** | 0.5786 | 🟑 Fair |

![Per-Category F1 Scores](per_category_f1.png)

---

## 🎯 Supported Categories

1. **Hate** - Hate speech and extremism
2. **Harassment** - Bullying, threats, personal attacks
3. **Sexual** - Explicit sexual content
4. **ChildSafety** - Content endangering minors
5. **Violence** - Gore, graphic violence, harm instructions
6. **Illegal** - Illegal activities (drugs, weapons, fraud)
7. **Privacy** - PII exposure, doxxing

---

## πŸš€ Usage

To inference with this model, you **MUST** define the custom architecture class and load both the LoRA adapter and the classifier head.

### 1. Install Dependencies
```bash
pip install torch transformers peft huggingface_hub
```

### 2. Inference Code

```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from huggingface_hub import hf_hub_download

# --- MODEL DEFINITION (Must match training) ---
class SentinelLFMMultiLabel(nn.Module):
    def __init__(self, model_id, num_labels):
        super().__init__()
        self.num_labels = num_labels
        self.base_model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
        self.config = self.base_model.config
        hidden_size = self.config.hidden_size
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size, num_labels)
        )
        self.loss_fct = nn.BCEWithLogitsLoss()
    
    def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
        hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs.last_hidden_state
        if attention_mask is not None:
            last_idx = attention_mask.sum(1) - 1
            pooled = hidden_states[torch.arange(input_ids.shape[0], device=input_ids.device), last_idx]
        else:
            pooled = hidden_states[:, -1, :]
        logits = self.classifier(pooled)
        loss = self.loss_fct(logits, labels.float()) if labels is not None else None
        from transformers.modeling_outputs import SequenceClassifierOutput
        return SequenceClassifierOutput(loss=loss, logits=logits)

# --- SETUP ---
CATS = ["Hate", "Harassment", "Sexual", "ChildSafety", "Violence", "Illegal", "Privacy"]
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
REPO_ID = "abdulmunimjemal/Sentinel-Rail-B-Policy-Guard"

# 1. Initialize Model Architecture (Loads Base 350M)
print("Loading base model...")
model = SentinelLFMMultiLabel("LiquidAI/LFM2-350M", num_labels=7)

# 2. Load LoRA Adapter
print("Loading LoRA adapter...")
model.base_model = PeftModel.from_pretrained(model.base_model, REPO_ID)

# 3. Load Custom Classifier Head
print("Loading classifier head...")
classifier_path = hf_hub_download(repo_id=REPO_ID, filename="classifier.pt")
state_dict = torch.load(classifier_path, map_location="cpu")
model.classifier.load_state_dict(state_dict)

model.to(DEVICE)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-350M", trust_remote_code=True)

# --- PREDICT ---
text = "How do I make a homemade explosive?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(DEVICE)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits)[0]

print(f"\nInput: {text}")
print("-" * 30)
for i, prob in enumerate(probs):
    if prob > 0.5:
        print(f"🚨 {CATS[i]}: {prob:.4f}")
```

---

## πŸ“¦ Dataset Stats

Trained on a **balanced dataset** of ~189,000 samples (50% Safe / 50% Violations).
Rare classes like Privacy and Illegal were upsampled to ~15,000 samples each to ensure high performance (F1 > 0.97).

---

## πŸ“œ License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)