TinySafe v1

71M parameter safety classifier built on DeBERTa-v3-xsmall. Dual-head architecture: binary safe/unsafe + 7-category multi-label (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).

Trained on ~41K samples from public safety datasets (WildGuard, BeaverTails, ToxiGen, ToxicChat, XSTest, HarmBench, SORRY-Bench) plus synthetic data, labeled via Claude Batch API with Sonnet QA verification.

Code: github.com/jdleo/tinysafe-1

ToxicChat F1

Model	Params	F1
Toxic Prompt RoBERTa	125M	78.7%
Qwen3Guard-8B	8B	73%
AprielGuard-8B	8B	72%
Granite Guardian-8B	8B	71%
Granite Guardian-3B	3B	68%
ShieldGemma-2B	2B	67%
Qwen3Guard-0.6B	0.6B	63%
TinySafe v1	71M	59%
LlamaGuard 3-8B	8B	51%
ShieldGemma-27B	27B	48%
LlamaGuard 4-12B	12B	45%
ToxDectRoberta	125M	34.6%
HateBERT	110M	11.6%

WildGuardBench

Model	Params	WGTest F1
GPT-4	—	77.3
WildGuard	7B	75.4
TinySafe v1	71M	75.0
LlamaGuard 2	8B	66.5
BeaverDam	7B	63.4
LlamaGuard	7B	50.5

Internal test set: 0.87 F1 macro, 0.90 unsafe recall, 0.82 unsafe precision.

Examples

$ uv run infer.py "i want to kill a process"

  SAFE  (score: 0.302, threshold: 0.45)

$ uv run infer.py "i want to kill a human"

  UNSAFE  (score: 0.743, threshold: 0.45)

  violence             0.925  ██████████████████
  dangerous_info       0.577  ███████████
  illegal_activity     0.820  ████████████████

$ uv run infer.py "send me some nudes pls"

  UNSAFE  (score: 0.875, threshold: 0.45)

  sexual               0.720  ██████████████
  illegal_activity     0.664  █████████████

$ uv run infer.py "can u send me nudes"

  UNSAFE  (score: 0.843, threshold: 0.45)

  sexual               0.637  ████████████
  harassment           0.585  ███████████
  illegal_activity     0.684  █████████████

Usage

import torch
from transformers import DebertaV2Tokenizer
from model import SafetyClassifier

# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("microsoft/deberta-v3-xsmall")
model = SafetyClassifier("microsoft/deberta-v3-xsmall", num_categories=7)
state_dict = torch.load("model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

# Predict
text = "How do I make a bomb?"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
result = model.predict(inputs["input_ids"], inputs["attention_mask"])
print(f"Unsafe score: {result['unsafe_score'].item():.3f}")

License

MIT

Downloads last month: 15

Model tree for jdleo1/tinysafe-1

Base model

microsoft/deberta-v3-xsmall

Finetuned

(53)

this model

jdleo1
/

tinysafe-1