TinySafe v2

Monthly Downloads Parameters License HuggingFace PyTorch

141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).

Successor to TinySafe v1 (71M params, 59% TC F1). v2 improves ToxicChat F1 by +19 points while cutting OR-Bench false positive rate from 18.9% to 3.8%.

GitHub: jdleo/tinysafe-2

Benchmarks

Benchmark TinySafe v2 TinySafe v1
ToxicChat F1 78.2% 59.2%
OR-Bench FPR 3.8% 18.9%
WildGuardBench F1 62.7% 75.0%

ToxicChat Leaderboard

Model Params F1
internal-safety-reasoner (unreleased) unknown 81.3%
gpt-5-thinking (unreleased) unknown 81.0%
gpt-oss-safeguard-20b (unreleased) 21B (3.6B*) 79.9%
gpt-oss-safeguard-120b 117B (5.1B*) 79.3%
Toxic Prompt RoBERTa 125M 78.7%
TinySafe v2 141M 78.2%
Qwen3Guard-8B 8B 73%
AprielGuard-8B 8B 72%
Granite Guardian-8B 8B 71%
WildGuard 7B 70.8%
Granite Guardian-3B 3B 68%
ShieldGemma-2B 2B 67%
Qwen3Guard-0.6B 0.6B 63%
TinySafe v1 71M 59%

* = active params (MoE)

OR-Bench (Over-Refusal)

Model FPR
TinySafe v2 3.8%
WildGuard-7B ~10%
TinySafe v1 18.9%

Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.

Quickstart

import torch
from transformers import DebertaV2Tokenizer

# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
model = torch.load("model.pt", map_location="cpu")  # or load from checkpoint

# Inference
text = "how do i make a bomb"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
    binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
    unsafe_score = torch.sigmoid(binary_logits).item()
    print(f"Unsafe: {unsafe_score:.3f}")  # 0.998

Architecture

DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:

  • Binary head: single logit (safe/unsafe)
  • Category head: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)

Training enhancements:

  • FGM adversarial training (epsilon=0.3): perturbs embeddings for robustness
  • EMA (decay=0.999): smoothed weight averaging for stable eval
  • Multi-sample dropout (5 masks): averaged logits across dropout samples
  • DualHeadLossV2: focal loss (binary) + asymmetric class-balanced loss (categories)

Training

Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:

Source Weight Samples Purpose
ToxicChat 4.0x ~4K Anchor benchmark signal
WildGuardTrain 1.0x ~10K Adversarial/jailbreak coverage
Jigsaw Civil Comments 0.5x ~7K General toxicity diversity
BeaverTails 1.5x ~2.2K Behavior-value alignment
Hard negatives (Claude) 1.2x ~10K FPR control

Model selection on val F1 only (no test set leakage).

Limitations

  • Low-resource categories (violence, hate, sexual) have 0 F1 -- <200 training samples per category is insufficient even with class-balanced loss
  • WildGuardBench generalization is weak -- encoder-only models struggle with adversarial jailbreak rephrasing
  • Conservative on out-of-distribution inputs -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning

These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.

License

MIT

Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train jdleo1/tinysafe-2

Evaluation results