---
language:
- en
license: apache-2.0
tags:
- content-moderation
- safety
- guardrails
- multi-label-classification
- liquid-ai
- lfm-350m
- sentinel-slm
- lora
- peft
base_model: LiquidAI/LFM2-350M
datasets:
- custom-balanced-rail-b
pipeline_tag: text-classification
library_name: transformers
metrics:
- f1
---

# 🛡️ Sentinel Rail B: Policy Guard (350M)

**Sentinel Rail B** is a lightweight, efficient **multi-label classifier** designed to detect 7 distinct categories of policy violations in text. 

> **Architecture Note**: This model uses a custom classification head on top of the **LiquidAI LFM2-350M** base model. The repository contains the LoRA adapter weights (`adapter_model.safetensors`) AND the separate classifier head weights (`classifier.pt`).

---

## 📊 Performance

| Metric | Score |
|--------|-------|
| **F1 Micro** | 0.7647 |
| **F1 Macro** | 0.7793 |
| **Hamming Loss** | 0.0466 |

### Per-Category F1 Scores

| Category | F1 Score | Status |
|----------|----------|--------|
| **Privacy** | 0.9927 | 🟢 Excellent |
| **Illegal** | 0.9750 | 🟢 Excellent |
| **ChildSafety** | 0.7783 | 🟢 Good |
| **Violence** | 0.7727 | 🟢 Good |
| **Sexual** | 0.7415 | 🟢 Good |
| **Harassment** | 0.6160 | 🟡 Fair |
| **Hate** | 0.5786 | 🟡 Fair |

![Per-Category F1 Scores](per_category_f1.png)

---

## 🎯 Supported Categories

1. **Hate** - Hate speech and extremism
2. **Harassment** - Bullying, threats, personal attacks
3. **Sexual** - Explicit sexual content
4. **ChildSafety** - Content endangering minors
5. **Violence** - Gore, graphic violence, harm instructions
6. **Illegal** - Illegal activities (drugs, weapons, fraud)
7. **Privacy** - PII exposure, doxxing

---

## 🚀 Usage

To inference with this model, you **MUST** define the custom architecture class and load both the LoRA adapter and the classifier head.

### 1. Install Dependencies
```bash
pip install torch transformers peft huggingface_hub
```

### 2. Inference Code

```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from huggingface_hub import hf_hub_download

# --- MODEL DEFINITION (Must match training) ---
class SentinelLFMMultiLabel(nn.Module):
    def __init__(self, model_id, num_labels):
        super().__init__()
        self.num_labels = num_labels
        self.base_model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
        self.config = self.base_model.config
        hidden_size = self.config.hidden_size
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size, num_labels)
        )
        self.loss_fct = nn.BCEWithLogitsLoss()
    
    def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
        hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs.last_hidden_state
        if attention_mask is not None:
            last_idx = attention_mask.sum(1) - 1
            pooled = hidden_states[torch.arange(input_ids.shape[0], device=input_ids.device), last_idx]
        else:
            pooled = hidden_states[:, -1, :]
        logits = self.classifier(pooled)
        loss = self.loss_fct(logits, labels.float()) if labels is not None else None
        from transformers.modeling_outputs import SequenceClassifierOutput
        return SequenceClassifierOutput(loss=loss, logits=logits)

# --- SETUP ---
CATS = ["Hate", "Harassment", "Sexual", "ChildSafety", "Violence", "Illegal", "Privacy"]
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
REPO_ID = "abdulmunimjemal/Sentinel-Rail-B-Policy-Guard"

# 1. Initialize Model Architecture (Loads Base 350M)
print("Loading base model...")
model = SentinelLFMMultiLabel("LiquidAI/LFM2-350M", num_labels=7)

# 2. Load LoRA Adapter
print("Loading LoRA adapter...")
model.base_model = PeftModel.from_pretrained(model.base_model, REPO_ID)

# 3. Load Custom Classifier Head
print("Loading classifier head...")
classifier_path = hf_hub_download(repo_id=REPO_ID, filename="classifier.pt")
state_dict = torch.load(classifier_path, map_location="cpu")
model.classifier.load_state_dict(state_dict)

model.to(DEVICE)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-350M", trust_remote_code=True)

# --- PREDICT ---
text = "How do I make a homemade explosive?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(DEVICE)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits)[0]

print(f"\nInput: {text}")
print("-" * 30)
for i, prob in enumerate(probs):
    if prob > 0.5:
        print(f"🚨 {CATS[i]}: {prob:.4f}")
```

---

## 📦 Dataset Stats

Trained on a **balanced dataset** of ~189,000 samples (50% Safe / 50% Violations).
Rare classes like Privacy and Illegal were upsampled to ~15,000 samples each to ensure high performance (F1 > 0.97).

---

## 📜 License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)