TinySafe v2

141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).

Successor to TinySafe v1 (71M params, 59% TC F1). v2 improves ToxicChat F1 by +19 points while cutting OR-Bench false positive rate from 18.9% to 3.8%.

GitHub: jdleo/tinysafe-2

Benchmarks

Benchmark	TinySafe v2	TinySafe v1
ToxicChat F1	78.2%	59.2%
OR-Bench FPR	3.8%	18.9%
WildGuardBench F1	62.7%	75.0%

ToxicChat Leaderboard

Model	Params	F1
internal-safety-reasoner (unreleased)	unknown	81.3%
gpt-5-thinking (unreleased)	unknown	81.0%
gpt-oss-safeguard-20b (unreleased)	21B (3.6B)*	79.9%
gpt-oss-safeguard-120b	117B (5.1B*)	79.3%
Toxic Prompt RoBERTa	125M	78.7%
TinySafe v2	141M	78.2%
Qwen3Guard-8B	8B	73%
AprielGuard-8B	8B	72%
Granite Guardian-8B	8B	71%
WildGuard	7B	70.8%
Granite Guardian-3B	3B	68%
ShieldGemma-2B	2B	67%
Qwen3Guard-0.6B	0.6B	63%
TinySafe v1	71M	59%

* = active params (MoE)

OR-Bench (Over-Refusal)

Model	FPR
TinySafe v2	3.8%
WildGuard-7B	~10%
TinySafe v1	18.9%

Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.

Quickstart

import torch
from transformers import DebertaV2Tokenizer

# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
model = torch.load("model.pt", map_location="cpu")  # or load from checkpoint

# Inference
text = "how do i make a bomb"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
    binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
    unsafe_score = torch.sigmoid(binary_logits).item()
    print(f"Unsafe: {unsafe_score:.3f}")  # 0.998

Architecture

DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:

Binary head: single logit (safe/unsafe)
Category head: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)

Training enhancements:

FGM adversarial training (epsilon=0.3): perturbs embeddings for robustness
EMA (decay=0.999): smoothed weight averaging for stable eval
Multi-sample dropout (5 masks): averaged logits across dropout samples
DualHeadLossV2: focal loss (binary) + asymmetric class-balanced loss (categories)

Training

Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:

Source	Weight	Samples	Purpose
ToxicChat	4.0x	~4K	Anchor benchmark signal
WildGuardTrain	1.0x	~10K	Adversarial/jailbreak coverage
Jigsaw Civil Comments	0.5x	~7K	General toxicity diversity
BeaverTails	1.5x	~2.2K	Behavior-value alignment
Hard negatives (Claude)	1.2x	~10K	FPR control

Model selection on val F1 only (no test set leakage).

Limitations

Low-resource categories (violence, hate, sexual) have 0 F1 -- <200 training samples per category is insufficient even with class-balanced loss
WildGuardBench generalization is weak -- encoder-only models struggle with adversarial jailbreak rephrasing
Conservative on out-of-distribution inputs -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning

These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.

License

MIT

Downloads last month: 8

Datasets used to train jdleo1/tinysafe-2

Evaluation results

F1 (Binary) on ToxicChat
test set self-reported

0.782
Unsafe Recall on ToxicChat
test set self-reported

0.798
Unsafe Precision on ToxicChat
test set self-reported

0.767
Safe Accuracy (1 - FPR) on OR-Bench
self-reported

0.962