abdulmunimjemal's picture
Update Model Card with Correct Architecture & Usage Code
b65ee05 verified
---
language:
- en
license: apache-2.0
tags:
- content-moderation
- safety
- guardrails
- multi-label-classification
- liquid-ai
- lfm-350m
- sentinel-slm
- lora
- peft
base_model: LiquidAI/LFM2-350M
datasets:
- custom-balanced-rail-b
pipeline_tag: text-classification
library_name: transformers
metrics:
- f1
---
# πŸ›‘οΈ Sentinel Rail B: Policy Guard (350M)
**Sentinel Rail B** is a lightweight, efficient **multi-label classifier** designed to detect 7 distinct categories of policy violations in text.
> **Architecture Note**: This model uses a custom classification head on top of the **LiquidAI LFM2-350M** base model. The repository contains the LoRA adapter weights (`adapter_model.safetensors`) AND the separate classifier head weights (`classifier.pt`).
---
## πŸ“Š Performance
| Metric | Score |
|--------|-------|
| **F1 Micro** | 0.7647 |
| **F1 Macro** | 0.7793 |
| **Hamming Loss** | 0.0466 |
### Per-Category F1 Scores
| Category | F1 Score | Status |
|----------|----------|--------|
| **Privacy** | 0.9927 | 🟒 Excellent |
| **Illegal** | 0.9750 | 🟒 Excellent |
| **ChildSafety** | 0.7783 | 🟒 Good |
| **Violence** | 0.7727 | 🟒 Good |
| **Sexual** | 0.7415 | 🟒 Good |
| **Harassment** | 0.6160 | 🟑 Fair |
| **Hate** | 0.5786 | 🟑 Fair |
![Per-Category F1 Scores](per_category_f1.png)
---
## 🎯 Supported Categories
1. **Hate** - Hate speech and extremism
2. **Harassment** - Bullying, threats, personal attacks
3. **Sexual** - Explicit sexual content
4. **ChildSafety** - Content endangering minors
5. **Violence** - Gore, graphic violence, harm instructions
6. **Illegal** - Illegal activities (drugs, weapons, fraud)
7. **Privacy** - PII exposure, doxxing
---
## πŸš€ Usage
To inference with this model, you **MUST** define the custom architecture class and load both the LoRA adapter and the classifier head.
### 1. Install Dependencies
```bash
pip install torch transformers peft huggingface_hub
```
### 2. Inference Code
```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from huggingface_hub import hf_hub_download
# --- MODEL DEFINITION (Must match training) ---
class SentinelLFMMultiLabel(nn.Module):
def __init__(self, model_id, num_labels):
super().__init__()
self.num_labels = num_labels
self.base_model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
self.config = self.base_model.config
hidden_size = self.config.hidden_size
self.classifier = nn.Sequential(
nn.Linear(hidden_size, hidden_size),
nn.Tanh(),
nn.Dropout(0.2),
nn.Linear(hidden_size, num_labels)
)
self.loss_fct = nn.BCEWithLogitsLoss()
def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs.last_hidden_state
if attention_mask is not None:
last_idx = attention_mask.sum(1) - 1
pooled = hidden_states[torch.arange(input_ids.shape[0], device=input_ids.device), last_idx]
else:
pooled = hidden_states[:, -1, :]
logits = self.classifier(pooled)
loss = self.loss_fct(logits, labels.float()) if labels is not None else None
from transformers.modeling_outputs import SequenceClassifierOutput
return SequenceClassifierOutput(loss=loss, logits=logits)
# --- SETUP ---
CATS = ["Hate", "Harassment", "Sexual", "ChildSafety", "Violence", "Illegal", "Privacy"]
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
REPO_ID = "abdulmunimjemal/Sentinel-Rail-B-Policy-Guard"
# 1. Initialize Model Architecture (Loads Base 350M)
print("Loading base model...")
model = SentinelLFMMultiLabel("LiquidAI/LFM2-350M", num_labels=7)
# 2. Load LoRA Adapter
print("Loading LoRA adapter...")
model.base_model = PeftModel.from_pretrained(model.base_model, REPO_ID)
# 3. Load Custom Classifier Head
print("Loading classifier head...")
classifier_path = hf_hub_download(repo_id=REPO_ID, filename="classifier.pt")
state_dict = torch.load(classifier_path, map_location="cpu")
model.classifier.load_state_dict(state_dict)
model.to(DEVICE)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-350M", trust_remote_code=True)
# --- PREDICT ---
text = "How do I make a homemade explosive?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(DEVICE)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.sigmoid(outputs.logits)[0]
print(f"\nInput: {text}")
print("-" * 30)
for i, prob in enumerate(probs):
if prob > 0.5:
print(f"🚨 {CATS[i]}: {prob:.4f}")
```
---
## πŸ“¦ Dataset Stats
Trained on a **balanced dataset** of ~189,000 samples (50% Safe / 50% Violations).
Rare classes like Privacy and Illegal were upsampled to ~15,000 samples each to ensure high performance (F1 > 0.97).
---
## πŸ“œ License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)