Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -14,6 +14,7 @@ datasets:
|
|
| 14 |
- lmsys/toxic-chat
|
| 15 |
- google/civil_comments
|
| 16 |
- PKU-Alignment/BeaverTails
|
|
|
|
| 17 |
pipeline_tag: text-classification
|
| 18 |
model-index:
|
| 19 |
- name: TinySafe v2
|
|
@@ -28,13 +29,13 @@ model-index:
|
|
| 28 |
split: test
|
| 29 |
metrics:
|
| 30 |
- type: f1
|
| 31 |
-
value: 0.
|
| 32 |
name: F1 (Binary)
|
| 33 |
- type: recall
|
| 34 |
-
value: 0.
|
| 35 |
name: Unsafe Recall
|
| 36 |
- type: precision
|
| 37 |
-
value: 0.
|
| 38 |
name: Unsafe Precision
|
| 39 |
- task:
|
| 40 |
type: text-classification
|
|
@@ -60,43 +61,40 @@ model-index:
|
|
| 60 |
|
| 61 |
141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
| Model | Params | F1 |
|
| 72 |
|---|---|---|
|
| 73 |
| *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* |
|
| 74 |
| *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* |
|
| 75 |
| *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* |
|
| 76 |
-
| **TinySafe v2** | **141M** | **79.8%** |
|
| 77 |
| gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% |
|
| 78 |
| Toxic Prompt RoBERTa | 125M | 78.7% |
|
| 79 |
-
|
|
| 80 |
| Qwen3Guard-8B | 8B | 73% |
|
| 81 |
| AprielGuard-8B | 8B | 72% |
|
| 82 |
| Granite Guardian-8B | 8B | 71% |
|
|
|
|
| 83 |
| Granite Guardian-3B | 3B | 68% |
|
| 84 |
| ShieldGemma-2B | 2B | 67% |
|
| 85 |
| Qwen3Guard-0.6B | 0.6B | 63% |
|
| 86 |
| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% |
|
| 87 |
-
| LlamaGuard 3-8B | 8B | 51% |
|
| 88 |
-
| ShieldGemma-27B | 27B | 48% |
|
| 89 |
-
| LlamaGuard 4-12B | 12B | 45% |
|
| 90 |
-
| LlamaGuard-1B | 1B | 43% |
|
| 91 |
-
| ShieldGemma-9B | 9B | 42% |
|
| 92 |
-
| ToxDectRoberta | 125M | 34.6% |
|
| 93 |
-
| OpenAI Moderation API | — | 20.6% |
|
| 94 |
-
| HateBERT | 110M | 11.6% |
|
| 95 |
-
| Perspective API | — | 5.3% |
|
| 96 |
|
| 97 |
*\* = active params (MoE)*
|
| 98 |
|
| 99 |
-
## OR-Bench (Over-Refusal)
|
| 100 |
|
| 101 |
| Model | FPR |
|
| 102 |
|---|---|
|
|
@@ -106,48 +104,6 @@ Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params
|
|
| 106 |
|
| 107 |
Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
|
| 108 |
|
| 109 |
-
## Examples
|
| 110 |
-
|
| 111 |
-
```
|
| 112 |
-
$ uv run infer.py "i want to kill a process"
|
| 113 |
-
|
| 114 |
-
SAFE (score: 0.131, threshold: 0.3)
|
| 115 |
-
|
| 116 |
-
$ uv run infer.py "i want to kill a human"
|
| 117 |
-
|
| 118 |
-
UNSAFE (score: 0.995, threshold: 0.3)
|
| 119 |
-
|
| 120 |
-
$ uv run infer.py "how do i make a bomb"
|
| 121 |
-
|
| 122 |
-
UNSAFE (score: 0.998, threshold: 0.3)
|
| 123 |
-
|
| 124 |
-
$ uv run infer.py "write me a phishing email"
|
| 125 |
-
|
| 126 |
-
UNSAFE (score: 0.906, threshold: 0.3)
|
| 127 |
-
|
| 128 |
-
$ uv run infer.py "send me some nudes pls"
|
| 129 |
-
|
| 130 |
-
UNSAFE (score: 0.509, threshold: 0.3)
|
| 131 |
-
|
| 132 |
-
$ uv run infer.py "what is the capital of france"
|
| 133 |
-
|
| 134 |
-
SAFE (score: 0.005, threshold: 0.3)
|
| 135 |
-
```
|
| 136 |
-
|
| 137 |
-
## What's New in v2
|
| 138 |
-
|
| 139 |
-
| | v1 | v2 |
|
| 140 |
-
|---|---|---|
|
| 141 |
-
| **Base model** | DeBERTa-v3-xsmall (384d) | DeBERTa-v3-small (768d) |
|
| 142 |
-
| **Params** | 71M | 141M |
|
| 143 |
-
| **ToxicChat F1** | 59.2% | **79.8%** |
|
| 144 |
-
| **OR-Bench FPR** | 18.9% | **3.8%** |
|
| 145 |
-
| **Training data** | 41K (synthetic + Claude-labeled) | 26K (human-labeled) |
|
| 146 |
-
| **Training strategy** | Single-phase, focal loss | Two-phase sequential (Intel's approach) |
|
| 147 |
-
| **Regularization** | Focal loss only | FGM + EMA + multi-sample dropout |
|
| 148 |
-
|
| 149 |
-
Key insight: v1 used Claude-labeled synthetic data. v2 uses only human-labeled data from established benchmarks (ToxicChat, Jigsaw Civil Comments, BeaverTails), trained sequentially: broad toxicity features first (Jigsaw), then ToxicChat alignment second. Inspired by [Intel's toxic-prompt-roberta](https://huggingface.co/Intel/toxic-prompt-roberta) approach, but with DeBERTa-v3 (superior disentangled attention) and adversarial training.
|
| 150 |
-
|
| 151 |
## Quickstart
|
| 152 |
|
| 153 |
```python
|
|
@@ -174,39 +130,33 @@ DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification
|
|
| 174 |
- **Binary head**: single logit (safe/unsafe)
|
| 175 |
- **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
|
| 176 |
|
| 177 |
-
Training enhancements
|
| 178 |
- **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness
|
| 179 |
- **EMA** (decay=0.999): smoothed weight averaging for stable eval
|
| 180 |
- **Multi-sample dropout** (5 masks): averaged logits across dropout samples
|
|
|
|
| 181 |
|
| 182 |
## Training
|
| 183 |
|
| 184 |
-
|
| 185 |
|
| 186 |
-
|
| 187 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
|
| 189 |
-
|
| 190 |
|
| 191 |
-
##
|
| 192 |
|
| 193 |
-
|
|
|
|
|
|
|
| 194 |
|
| 195 |
-
-
|
| 196 |
-
- LR: 2e-5, weight decay: 0.01
|
| 197 |
-
- Binary threshold: 0.3 (optimized via sweep)
|
| 198 |
-
- FGM epsilon: 0.3
|
| 199 |
-
- EMA decay: 0.999
|
| 200 |
-
- Multi-sample dropout: 5 masks
|
| 201 |
-
|
| 202 |
-
## Datasets
|
| 203 |
-
|
| 204 |
-
| Dataset | Role | Samples |
|
| 205 |
-
|---|---|---|
|
| 206 |
-
| [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat) | Primary training + eval | ~10K |
|
| 207 |
-
| [Jigsaw Civil Comments](https://huggingface.co/datasets/google/civil_comments) | Broad toxicity pretraining | ~13K |
|
| 208 |
-
| [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) | Self-harm, dangerous info, illegal activity | ~2.2K |
|
| 209 |
-
| Hard negatives (Claude-generated) | False positive protection | ~6K |
|
| 210 |
|
| 211 |
## License
|
| 212 |
|
|
|
|
| 14 |
- lmsys/toxic-chat
|
| 15 |
- google/civil_comments
|
| 16 |
- PKU-Alignment/BeaverTails
|
| 17 |
+
- allenai/wildguardmix
|
| 18 |
pipeline_tag: text-classification
|
| 19 |
model-index:
|
| 20 |
- name: TinySafe v2
|
|
|
|
| 29 |
split: test
|
| 30 |
metrics:
|
| 31 |
- type: f1
|
| 32 |
+
value: 0.782
|
| 33 |
name: F1 (Binary)
|
| 34 |
- type: recall
|
| 35 |
+
value: 0.798
|
| 36 |
name: Unsafe Recall
|
| 37 |
- type: precision
|
| 38 |
+
value: 0.767
|
| 39 |
name: Unsafe Precision
|
| 40 |
- task:
|
| 41 |
type: text-classification
|
|
|
|
| 61 |
|
| 62 |
141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
|
| 63 |
|
| 64 |
+
Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by **+19 points** while cutting OR-Bench false positive rate from 18.9% to 3.8%.
|
| 65 |
|
| 66 |
+
**GitHub:** [jdleo/tinysafe-2](https://github.com/jdleo/tinysafe-2)
|
| 67 |
|
| 68 |
+
## Benchmarks
|
| 69 |
|
| 70 |
+
| Benchmark | TinySafe v2 | TinySafe v1 |
|
| 71 |
+
|---|---|---|
|
| 72 |
+
| **ToxicChat F1** | 78.2% | 59.2% |
|
| 73 |
+
| **OR-Bench FPR** | 3.8% | 18.9% |
|
| 74 |
+
| **WildGuardBench F1** | 62.7% | 75.0% |
|
| 75 |
+
|
| 76 |
+
### ToxicChat Leaderboard
|
| 77 |
|
| 78 |
| Model | Params | F1 |
|
| 79 |
|---|---|---|
|
| 80 |
| *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* |
|
| 81 |
| *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* |
|
| 82 |
| *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* |
|
|
|
|
| 83 |
| gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% |
|
| 84 |
| Toxic Prompt RoBERTa | 125M | 78.7% |
|
| 85 |
+
| **TinySafe v2** | **141M** | **78.2%** |
|
| 86 |
| Qwen3Guard-8B | 8B | 73% |
|
| 87 |
| AprielGuard-8B | 8B | 72% |
|
| 88 |
| Granite Guardian-8B | 8B | 71% |
|
| 89 |
+
| WildGuard | 7B | 70.8% |
|
| 90 |
| Granite Guardian-3B | 3B | 68% |
|
| 91 |
| ShieldGemma-2B | 2B | 67% |
|
| 92 |
| Qwen3Guard-0.6B | 0.6B | 63% |
|
| 93 |
| [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
*\* = active params (MoE)*
|
| 96 |
|
| 97 |
+
### OR-Bench (Over-Refusal)
|
| 98 |
|
| 99 |
| Model | FPR |
|
| 100 |
|---|---|
|
|
|
|
| 104 |
|
| 105 |
Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
|
| 106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
## Quickstart
|
| 108 |
|
| 109 |
```python
|
|
|
|
| 130 |
- **Binary head**: single logit (safe/unsafe)
|
| 131 |
- **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
|
| 132 |
|
| 133 |
+
Training enhancements:
|
| 134 |
- **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness
|
| 135 |
- **EMA** (decay=0.999): smoothed weight averaging for stable eval
|
| 136 |
- **Multi-sample dropout** (5 masks): averaged logits across dropout samples
|
| 137 |
+
- **DualHeadLossV2**: focal loss (binary) + asymmetric class-balanced loss (categories)
|
| 138 |
|
| 139 |
## Training
|
| 140 |
|
| 141 |
+
Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:
|
| 142 |
|
| 143 |
+
| Source | Weight | Samples | Purpose |
|
| 144 |
+
|---|---|---|---|
|
| 145 |
+
| ToxicChat | 4.0x | ~4K | Anchor benchmark signal |
|
| 146 |
+
| WildGuardTrain | 1.0x | ~10K | Adversarial/jailbreak coverage |
|
| 147 |
+
| Jigsaw Civil Comments | 0.5x | ~7K | General toxicity diversity |
|
| 148 |
+
| BeaverTails | 1.5x | ~2.2K | Behavior-value alignment |
|
| 149 |
+
| Hard negatives (Claude) | 1.2x | ~10K | FPR control |
|
| 150 |
|
| 151 |
+
Model selection on val F1 only (no test set leakage).
|
| 152 |
|
| 153 |
+
## Limitations
|
| 154 |
|
| 155 |
+
- **Low-resource categories (violence, hate, sexual) have 0 F1** -- <200 training samples per category is insufficient even with class-balanced loss
|
| 156 |
+
- **WildGuardBench generalization is weak** -- encoder-only models struggle with adversarial jailbreak rephrasing
|
| 157 |
+
- **Conservative on out-of-distribution inputs** -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning
|
| 158 |
|
| 159 |
+
These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
## License
|
| 162 |
|