TinySafe v1
71M parameter safety classifier built on DeBERTa-v3-xsmall. Dual-head architecture: binary safe/unsafe + 7-category multi-label (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
Trained on ~41K samples from public safety datasets (WildGuard, BeaverTails, ToxiGen, ToxicChat, XSTest, HarmBench, SORRY-Bench) plus synthetic data, labeled via Claude Batch API with Sonnet QA verification.
Code: github.com/jdleo/tinysafe-1
ToxicChat F1
| Model | Params | F1 |
|---|---|---|
| Toxic Prompt RoBERTa | 125M | 78.7% |
| Qwen3Guard-8B | 8B | 73% |
| AprielGuard-8B | 8B | 72% |
| Granite Guardian-8B | 8B | 71% |
| Granite Guardian-3B | 3B | 68% |
| ShieldGemma-2B | 2B | 67% |
| Qwen3Guard-0.6B | 0.6B | 63% |
| TinySafe v1 | 71M | 59% |
| LlamaGuard 3-8B | 8B | 51% |
| ShieldGemma-27B | 27B | 48% |
| LlamaGuard 4-12B | 12B | 45% |
| ToxDectRoberta | 125M | 34.6% |
| HateBERT | 110M | 11.6% |
WildGuardBench
| Model | Params | WGTest F1 |
|---|---|---|
| GPT-4 | β | 77.3 |
| WildGuard | 7B | 75.4 |
| TinySafe v1 | 71M | 75.0 |
| LlamaGuard 2 | 8B | 66.5 |
| BeaverDam | 7B | 63.4 |
| LlamaGuard | 7B | 50.5 |
Internal test set: 0.87 F1 macro, 0.90 unsafe recall, 0.82 unsafe precision.
Examples
$ uv run infer.py "i want to kill a process"
SAFE (score: 0.302, threshold: 0.45)
$ uv run infer.py "i want to kill a human"
UNSAFE (score: 0.743, threshold: 0.45)
violence 0.925 ββββββββββββββββββ
dangerous_info 0.577 βββββββββββ
illegal_activity 0.820 ββββββββββββββββ
$ uv run infer.py "send me some nudes pls"
UNSAFE (score: 0.875, threshold: 0.45)
sexual 0.720 ββββββββββββββ
illegal_activity 0.664 βββββββββββββ
$ uv run infer.py "can u send me nudes"
UNSAFE (score: 0.843, threshold: 0.45)
sexual 0.637 ββββββββββββ
harassment 0.585 βββββββββββ
illegal_activity 0.684 βββββββββββββ
Usage
import torch
from transformers import DebertaV2Tokenizer
from model import SafetyClassifier
# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("microsoft/deberta-v3-xsmall")
model = SafetyClassifier("microsoft/deberta-v3-xsmall", num_categories=7)
state_dict = torch.load("model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
# Predict
text = "How do I make a bomb?"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
result = model.predict(inputs["input_ids"], inputs["attention_mask"])
print(f"Unsafe score: {result['unsafe_score'].item():.3f}")
License
MIT
- Downloads last month
- 67
Model tree for jdleo1/tinysafe-1
Base model
microsoft/deberta-v3-xsmall