jdleo1 commited on
Commit
fa2b831
·
verified ·
1 Parent(s): d80013a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +33 -83
README.md CHANGED
@@ -14,6 +14,7 @@ datasets:
14
  - lmsys/toxic-chat
15
  - google/civil_comments
16
  - PKU-Alignment/BeaverTails
 
17
  pipeline_tag: text-classification
18
  model-index:
19
  - name: TinySafe v2
@@ -28,13 +29,13 @@ model-index:
28
  split: test
29
  metrics:
30
  - type: f1
31
- value: 0.7977
32
  name: F1 (Binary)
33
  - type: recall
34
- value: 0.7983
35
  name: Unsafe Recall
36
  - type: precision
37
- value: 0.7666
38
  name: Unsafe Precision
39
  - task:
40
  type: text-classification
@@ -60,43 +61,40 @@ model-index:
60
 
61
  141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
62
 
63
- **#1 open-source safety classifier on ToxicChat.** Beats every open model including 8B+ guard models, and sits behind only unreleased OpenAI internal models.
64
 
65
- Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by **+20.5 points** while cutting OR-Bench false positive rate from 18.9% to 3.8%.
66
 
67
- **Model on HuggingFace:** [jdleo1/tinysafe-2](https://huggingface.co/jdleo1/tinysafe-2)
68
 
69
- ## ToxicChat F1
 
 
 
 
 
 
70
 
71
  | Model | Params | F1 |
72
  |---|---|---|
73
  | *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* |
74
  | *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* |
75
  | *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* |
76
- | **TinySafe v2** | **141M** | **79.8%** |
77
  | gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% |
78
  | Toxic Prompt RoBERTa | 125M | 78.7% |
79
- | WildGuard | 7B | 70.8% |
80
  | Qwen3Guard-8B | 8B | 73% |
81
  | AprielGuard-8B | 8B | 72% |
82
  | Granite Guardian-8B | 8B | 71% |
 
83
  | Granite Guardian-3B | 3B | 68% |
84
  | ShieldGemma-2B | 2B | 67% |
85
  | Qwen3Guard-0.6B | 0.6B | 63% |
86
  | [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% |
87
- | LlamaGuard 3-8B | 8B | 51% |
88
- | ShieldGemma-27B | 27B | 48% |
89
- | LlamaGuard 4-12B | 12B | 45% |
90
- | LlamaGuard-1B | 1B | 43% |
91
- | ShieldGemma-9B | 9B | 42% |
92
- | ToxDectRoberta | 125M | 34.6% |
93
- | OpenAI Moderation API | — | 20.6% |
94
- | HateBERT | 110M | 11.6% |
95
- | Perspective API | — | 5.3% |
96
 
97
  *\* = active params (MoE)*
98
 
99
- ## OR-Bench (Over-Refusal)
100
 
101
  | Model | FPR |
102
  |---|---|
@@ -106,48 +104,6 @@ Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params
106
 
107
  Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
108
 
109
- ## Examples
110
-
111
- ```
112
- $ uv run infer.py "i want to kill a process"
113
-
114
- SAFE (score: 0.131, threshold: 0.3)
115
-
116
- $ uv run infer.py "i want to kill a human"
117
-
118
- UNSAFE (score: 0.995, threshold: 0.3)
119
-
120
- $ uv run infer.py "how do i make a bomb"
121
-
122
- UNSAFE (score: 0.998, threshold: 0.3)
123
-
124
- $ uv run infer.py "write me a phishing email"
125
-
126
- UNSAFE (score: 0.906, threshold: 0.3)
127
-
128
- $ uv run infer.py "send me some nudes pls"
129
-
130
- UNSAFE (score: 0.509, threshold: 0.3)
131
-
132
- $ uv run infer.py "what is the capital of france"
133
-
134
- SAFE (score: 0.005, threshold: 0.3)
135
- ```
136
-
137
- ## What's New in v2
138
-
139
- | | v1 | v2 |
140
- |---|---|---|
141
- | **Base model** | DeBERTa-v3-xsmall (384d) | DeBERTa-v3-small (768d) |
142
- | **Params** | 71M | 141M |
143
- | **ToxicChat F1** | 59.2% | **79.8%** |
144
- | **OR-Bench FPR** | 18.9% | **3.8%** |
145
- | **Training data** | 41K (synthetic + Claude-labeled) | 26K (human-labeled) |
146
- | **Training strategy** | Single-phase, focal loss | Two-phase sequential (Intel's approach) |
147
- | **Regularization** | Focal loss only | FGM + EMA + multi-sample dropout |
148
-
149
- Key insight: v1 used Claude-labeled synthetic data. v2 uses only human-labeled data from established benchmarks (ToxicChat, Jigsaw Civil Comments, BeaverTails), trained sequentially: broad toxicity features first (Jigsaw), then ToxicChat alignment second. Inspired by [Intel's toxic-prompt-roberta](https://huggingface.co/Intel/toxic-prompt-roberta) approach, but with DeBERTa-v3 (superior disentangled attention) and adversarial training.
150
-
151
  ## Quickstart
152
 
153
  ```python
@@ -174,39 +130,33 @@ DeBERTa-v3-small (6 transformer layers, 768 hidden dim) with dual classification
174
  - **Binary head**: single logit (safe/unsafe)
175
  - **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
176
 
177
- Training enhancements over vanilla fine-tuning:
178
  - **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness
179
  - **EMA** (decay=0.999): smoothed weight averaging for stable eval
180
  - **Multi-sample dropout** (5 masks): averaged logits across dropout samples
 
181
 
182
  ## Training
183
 
184
- Two-phase sequential fine-tuning:
185
 
186
- 1. **Phase 1 Broad toxicity** (3 epochs, LR=2e-5): Jigsaw Civil Comments + BeaverTails + hard negatives (~21K samples). Learns general toxicity features.
187
- 2. **Phase 2 — ToxicChat alignment** (5 epochs, LR=2e-5): ToxicChat + hard negatives (~10K samples). Aligns decision boundary to ToxicChat's definition.
 
 
 
 
 
188
 
189
- Hard negatives are safe-but-edgy prompts generated via Claude to protect against false positives.
190
 
191
- ## Config
192
 
193
- All hyperparameters in `configs/config.json`:
 
 
194
 
195
- - Batch size: 32
196
- - LR: 2e-5, weight decay: 0.01
197
- - Binary threshold: 0.3 (optimized via sweep)
198
- - FGM epsilon: 0.3
199
- - EMA decay: 0.999
200
- - Multi-sample dropout: 5 masks
201
-
202
- ## Datasets
203
-
204
- | Dataset | Role | Samples |
205
- |---|---|---|
206
- | [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat) | Primary training + eval | ~10K |
207
- | [Jigsaw Civil Comments](https://huggingface.co/datasets/google/civil_comments) | Broad toxicity pretraining | ~13K |
208
- | [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) | Self-harm, dangerous info, illegal activity | ~2.2K |
209
- | Hard negatives (Claude-generated) | False positive protection | ~6K |
210
 
211
  ## License
212
 
 
14
  - lmsys/toxic-chat
15
  - google/civil_comments
16
  - PKU-Alignment/BeaverTails
17
+ - allenai/wildguardmix
18
  pipeline_tag: text-classification
19
  model-index:
20
  - name: TinySafe v2
 
29
  split: test
30
  metrics:
31
  - type: f1
32
+ value: 0.782
33
  name: F1 (Binary)
34
  - type: recall
35
+ value: 0.798
36
  name: Unsafe Recall
37
  - type: precision
38
+ value: 0.767
39
  name: Unsafe Precision
40
  - task:
41
  type: text-classification
 
61
 
62
  141M parameter safety classifier built on DeBERTa-v3-small. Binary safe/unsafe classification with 7-category multi-label head (violence, hate, sexual, self-harm, dangerous info, harassment, illegal activity).
63
 
64
+ Successor to [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M params, 59% TC F1). v2 improves ToxicChat F1 by **+19 points** while cutting OR-Bench false positive rate from 18.9% to 3.8%.
65
 
66
+ **GitHub:** [jdleo/tinysafe-2](https://github.com/jdleo/tinysafe-2)
67
 
68
+ ## Benchmarks
69
 
70
+ | Benchmark | TinySafe v2 | TinySafe v1 |
71
+ |---|---|---|
72
+ | **ToxicChat F1** | 78.2% | 59.2% |
73
+ | **OR-Bench FPR** | 3.8% | 18.9% |
74
+ | **WildGuardBench F1** | 62.7% | 75.0% |
75
+
76
+ ### ToxicChat Leaderboard
77
 
78
  | Model | Params | F1 |
79
  |---|---|---|
80
  | *internal-safety-reasoner (unreleased)* | *unknown* | *81.3%* |
81
  | *gpt-5-thinking (unreleased)* | *unknown* | *81.0%* |
82
  | *gpt-oss-safeguard-20b (unreleased)* | *21B (3.6B\*)* | *79.9%* |
 
83
  | gpt-oss-safeguard-120b | 117B (5.1B\*) | 79.3% |
84
  | Toxic Prompt RoBERTa | 125M | 78.7% |
85
+ | **TinySafe v2** | **141M** | **78.2%** |
86
  | Qwen3Guard-8B | 8B | 73% |
87
  | AprielGuard-8B | 8B | 72% |
88
  | Granite Guardian-8B | 8B | 71% |
89
+ | WildGuard | 7B | 70.8% |
90
  | Granite Guardian-3B | 3B | 68% |
91
  | ShieldGemma-2B | 2B | 67% |
92
  | Qwen3Guard-0.6B | 0.6B | 63% |
93
  | [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) | 71M | 59% |
 
 
 
 
 
 
 
 
 
94
 
95
  *\* = active params (MoE)*
96
 
97
+ ### OR-Bench (Over-Refusal)
98
 
99
  | Model | FPR |
100
  |---|---|
 
104
 
105
  Lower is better. On 80K safe prompts, TinySafe v2 incorrectly flags only 3.8%.
106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  ## Quickstart
108
 
109
  ```python
 
130
  - **Binary head**: single logit (safe/unsafe)
131
  - **Category head**: 7-way multi-label (violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity)
132
 
133
+ Training enhancements:
134
  - **FGM adversarial training** (epsilon=0.3): perturbs embeddings for robustness
135
  - **EMA** (decay=0.999): smoothed weight averaging for stable eval
136
  - **Multi-sample dropout** (5 masks): averaged logits across dropout samples
137
+ - **DualHeadLossV2**: focal loss (binary) + asymmetric class-balanced loss (categories)
138
 
139
  ## Training
140
 
141
+ Single-phase unified fine-tuning (5 epochs, LR=2e-5) with source-weighted sampling:
142
 
143
+ | Source | Weight | Samples | Purpose |
144
+ |---|---|---|---|
145
+ | ToxicChat | 4.0x | ~4K | Anchor benchmark signal |
146
+ | WildGuardTrain | 1.0x | ~10K | Adversarial/jailbreak coverage |
147
+ | Jigsaw Civil Comments | 0.5x | ~7K | General toxicity diversity |
148
+ | BeaverTails | 1.5x | ~2.2K | Behavior-value alignment |
149
+ | Hard negatives (Claude) | 1.2x | ~10K | FPR control |
150
 
151
+ Model selection on val F1 only (no test set leakage).
152
 
153
+ ## Limitations
154
 
155
+ - **Low-resource categories (violence, hate, sexual) have 0 F1** -- <200 training samples per category is insufficient even with class-balanced loss
156
+ - **WildGuardBench generalization is weak** -- encoder-only models struggle with adversarial jailbreak rephrasing
157
+ - **Conservative on out-of-distribution inputs** -- high precision but lower recall suggests the model learned narrow patterns rather than general safety reasoning
158
 
159
+ These are fundamental limitations of encoder-only architectures for safety classification. v3 will move to a small LLM (1-3B) to enable reasoning over intent rather than pattern matching over surface features.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
  ## License
162