Update Model Card with Correct Architecture & Usage Code

b65ee05 verified about 1 month ago

5.15 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- content-moderation
	- safety
	- guardrails
	- multi-label-classification
	- liquid-ai
	- lfm-350m
	- sentinel-slm
	- lora
	- peft
	base_model: LiquidAI/LFM2-350M
	datasets:
	- custom-balanced-rail-b
	pipeline_tag: text-classification
	library_name: transformers
	metrics:
	- f1
	---

	# 🛡️ Sentinel Rail B: Policy Guard (350M)

	Sentinel Rail B is a lightweight, efficient multi-label classifier designed to detect 7 distinct categories of policy violations in text.

	> Architecture Note: This model uses a custom classification head on top of the LiquidAI LFM2-350M base model. The repository contains the LoRA adapter weights (`adapter_model.safetensors`) AND the separate classifier head weights (`classifier.pt`).

	---

	## 📊 Performance

	\| Metric \| Score \|
	\|--------\|-------\|
	\| F1 Micro \| 0.7647 \|
	\| F1 Macro \| 0.7793 \|
	\| Hamming Loss \| 0.0466 \|

	### Per-Category F1 Scores

	\| Category \| F1 Score \| Status \|
	\|----------\|----------\|--------\|
	\| Privacy \| 0.9927 \| 🟢 Excellent \|
	\| Illegal \| 0.9750 \| 🟢 Excellent \|
	\| ChildSafety \| 0.7783 \| 🟢 Good \|
	\| Violence \| 0.7727 \| 🟢 Good \|
	\| Sexual \| 0.7415 \| 🟢 Good \|
	\| Harassment \| 0.6160 \| 🟡 Fair \|
	\| Hate \| 0.5786 \| 🟡 Fair \|

	![Per-Category F1 Scores](per_category_f1.png)

	---

	## 🎯 Supported Categories

	1. Hate - Hate speech and extremism
	2. Harassment - Bullying, threats, personal attacks
	3. Sexual - Explicit sexual content
	4. ChildSafety - Content endangering minors
	5. Violence - Gore, graphic violence, harm instructions
	6. Illegal - Illegal activities (drugs, weapons, fraud)
	7. Privacy - PII exposure, doxxing

	---

	## 🚀 Usage

	To inference with this model, you MUST define the custom architecture class and load both the LoRA adapter and the classifier head.

	### 1. Install Dependencies
	```bash
	pip install torch transformers peft huggingface_hub
	```

	### 2. Inference Code

	```python
	import torch
	import torch.nn as nn
	from transformers import AutoTokenizer, AutoModel
	from peft import PeftModel
	from huggingface_hub import hf_hub_download

	# --- MODEL DEFINITION (Must match training) ---
	class SentinelLFMMultiLabel(nn.Module):
	def __init__(self, model_id, num_labels):
	super().__init__()
	self.num_labels = num_labels
	self.base_model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
	self.config = self.base_model.config
	hidden_size = self.config.hidden_size
	self.classifier = nn.Sequential(
	nn.Linear(hidden_size, hidden_size),
	nn.Tanh(),
	nn.Dropout(0.2),
	nn.Linear(hidden_size, num_labels)
	)
	self.loss_fct = nn.BCEWithLogitsLoss()

	def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
	outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
	hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs.last_hidden_state
	if attention_mask is not None:
	last_idx = attention_mask.sum(1) - 1
	pooled = hidden_states[torch.arange(input_ids.shape[0], device=input_ids.device), last_idx]
	else:
	pooled = hidden_states[:, -1, :]
	logits = self.classifier(pooled)
	loss = self.loss_fct(logits, labels.float()) if labels is not None else None
	from transformers.modeling_outputs import SequenceClassifierOutput
	return SequenceClassifierOutput(loss=loss, logits=logits)

	# --- SETUP ---
	CATS = ["Hate", "Harassment", "Sexual", "ChildSafety", "Violence", "Illegal", "Privacy"]
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
	REPO_ID = "abdulmunimjemal/Sentinel-Rail-B-Policy-Guard"

	# 1. Initialize Model Architecture (Loads Base 350M)
	print("Loading base model...")
	model = SentinelLFMMultiLabel("LiquidAI/LFM2-350M", num_labels=7)

	# 2. Load LoRA Adapter
	print("Loading LoRA adapter...")
	model.base_model = PeftModel.from_pretrained(model.base_model, REPO_ID)

	# 3. Load Custom Classifier Head
	print("Loading classifier head...")
	classifier_path = hf_hub_download(repo_id=REPO_ID, filename="classifier.pt")
	state_dict = torch.load(classifier_path, map_location="cpu")
	model.classifier.load_state_dict(state_dict)

	model.to(DEVICE)
	model.eval()
	tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-350M", trust_remote_code=True)

	# --- PREDICT ---
	text = "How do I make a homemade explosive?"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(DEVICE)

	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.sigmoid(outputs.logits)[0]

	print(f"\nInput: {text}")
	print("-" * 30)
	for i, prob in enumerate(probs):
	if prob > 0.5:
	print(f"🚨 {CATS[i]}: {prob:.4f}")
	```

	---

	## 📦 Dataset Stats

	Trained on a balanced dataset of ~189,000 samples (50% Safe / 50% Violations).
	Rare classes like Privacy and Illegal were upsampled to ~15,000 samples each to ensure high performance (F1 > 0.97).

	---

	## 📜 License

	[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)