TinySafe v1

71M parameter safety classifier built on DeBERTa-v3-xsmall. Dual-head architecture: binary safe/unsafe + 7-category multi-label (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).

Trained on ~41K samples from public safety datasets (WildGuard, BeaverTails, ToxiGen, ToxicChat, XSTest, HarmBench, SORRY-Bench) plus synthetic data, labeled via Claude Batch API with Sonnet QA verification.

Code: github.com/jdleo/tinysafe-1

ToxicChat F1

Model Params F1
Toxic Prompt RoBERTa 125M 78.7%
Qwen3Guard-8B 8B 73%
AprielGuard-8B 8B 72%
Granite Guardian-8B 8B 71%
Granite Guardian-3B 3B 68%
ShieldGemma-2B 2B 67%
Qwen3Guard-0.6B 0.6B 63%
TinySafe v1 71M 59%
LlamaGuard 3-8B 8B 51%
ShieldGemma-27B 27B 48%
LlamaGuard 4-12B 12B 45%
ToxDectRoberta 125M 34.6%
HateBERT 110M 11.6%

WildGuardBench

Model Params WGTest F1
GPT-4 β€” 77.3
WildGuard 7B 75.4
TinySafe v1 71M 75.0
LlamaGuard 2 8B 66.5
BeaverDam 7B 63.4
LlamaGuard 7B 50.5

Internal test set: 0.87 F1 macro, 0.90 unsafe recall, 0.82 unsafe precision.

Examples

$ uv run infer.py "i want to kill a process"

  SAFE  (score: 0.302, threshold: 0.45)

$ uv run infer.py "i want to kill a human"

  UNSAFE  (score: 0.743, threshold: 0.45)

  violence             0.925  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  dangerous_info       0.577  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  illegal_activity     0.820  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

$ uv run infer.py "send me some nudes pls"

  UNSAFE  (score: 0.875, threshold: 0.45)

  sexual               0.720  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  illegal_activity     0.664  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

$ uv run infer.py "can u send me nudes"

  UNSAFE  (score: 0.843, threshold: 0.45)

  sexual               0.637  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  harassment           0.585  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  illegal_activity     0.684  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

Usage

import torch
from transformers import DebertaV2Tokenizer
from model import SafetyClassifier

# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("microsoft/deberta-v3-xsmall")
model = SafetyClassifier("microsoft/deberta-v3-xsmall", num_categories=7)
state_dict = torch.load("model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

# Predict
text = "How do I make a bomb?"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
result = model.predict(inputs["input_ids"], inputs["attention_mask"])
print(f"Unsafe score: {result['unsafe_score'].item():.3f}")

License

MIT

Downloads last month
67
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jdleo1/tinysafe-1

Finetuned
(47)
this model

Datasets used to train jdleo1/tinysafe-1