TinySafe v2
141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
Successor to TinySafe v1 (71M params, 59% TC F1). v2 improves ToxicChat F1 by +19 points while cutting OR-Bench false positive rate from 18.9% to 3.8%.
GitHub: jdleo/tinysafe-2
Benchmarks
| Benchmark | TinySafe v2 | TinySafe v1 |
|---|---|---|
| ToxicChat F1 | 78.2% | 59.2% |
| OR-Bench FPR | 3.8% | 18.9% |
| WildGuardBench F1 | 62.7% | 75.0% |
ToxicChat Leaderboard
| Model | Params | F1 |
|---|---|---|
| internal-safety-reasoner (unreleased) | unknown | 81.3% |
| gpt-5-thinking (unreleased) | unknown | 81.0% |
| gpt-oss-safeguard-20b (unreleased) | 21B (3.6B*) | 79.9% |
| gpt-oss-safeguard-120b | 117B (5.1B*) | 79.3% |
| Toxic Prompt RoBERTa | 125M | 78.7% |
| TinySafe v2 | 141M | 78.2% |
| Qwen3Guard-8B | 8B | 73% |
| AprielGuard-8B | 8B | 72% |
| Granite Guardian-8B | 8B | 71% |
| WildGuard | 7B | 70.8% |
| Granite Guardian-3B | 3B | 68% |
| ShieldGemma-2B | 2B | 67% |
| Qwen3Guard-0.6B | 0.6B | 63% |
| TinySafe v1 | 71M | 59% |
* = active params (MoE)
OR-Bench (Over-Refusal)
| Model | FPR |
|---|---|
| TinySafe v2 | 3.8% |
| WildGuard-7B | ~10% |
| TinySafe v1 | 18.9% |
Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
Quickstart
import torch
from transformers import DebertaV2Tokenizer
# Load
tokenizer = DebertaV2Tokenizer.from_pretrained("jdleo1/tinysafe-2")
model = torch.load("model.pt", map_location="cpu") # or load from checkpoint
# Inference
text = "how do i make a bomb"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
binary_logits, category_logits = model(inputs["input_ids"], inputs["attention_mask"])
unsafe_score = torch.sigmoid(binary_logits).item()
print(f"Unsafe: {unsafe_score:.3f}") # 0.998
Architecture
DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification heads:
- Binary head: single logit (safe/unsafe)
- Category head: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
Training enhancements:
- FGM adversarial training (epsilon=0.3): perturbs embeddings for robustness
- EMA (decay=0.999): smoothed weight averaging for stable eval
- Multi-sample dropout (5 masks): averaged logits across dropout samples
- DualHeadLossV2: focal loss (binary) + asymmetric class-balanced loss (categories)
Training
Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:
| Source | Weight | Samples | Purpose |
|---|---|---|---|
| ToxicChat | 4.0x | ~4K | Anchor benchmark signal |
| WildGuardTrain | 1.0x | ~10K | Adversarial/jailbreak coverage |
| Jigsaw Civil Comments | 0.5x | ~7K | General toxicity diversity |
| BeaverTails | 1.5x | ~2.2K | Behavior-value alignment |
| Hard negatives (Claude) | 1.2x | ~10K | FPR control |
Model selection on val F1 only (no test set leakage).
Limitations
- Low-resource categories (violence, hate, sexual) have 0 F1 -- <200 training samples per category is insufficient even with class-balanced loss
- WildGuardBench generalization is weak -- encoder-only models struggle with adversarial jailbreak rephrasing
- Conservative on out-of-distribution inputs -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning
These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.
License
MIT
- Downloads last month
- 37
Datasets used to train jdleo1/tinysafe-2
Evaluation results
- F1 (Binary) on ToxicChattest set self-reported0.782
- Unsafe Recall on ToxicChattest set self-reported0.798
- Unsafe Precision on ToxicChattest set self-reported0.767
- Safe Accuracy (1 - FPR) on OR-Benchself-reported0.962