jdleo1 commited on
Commit
a9027f1
·
verified ·
1 Parent(s): 9324eff

Add model card

Browse files
Files changed (1) hide show
  1. README.md +236 -0
README.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - zh
6
+ - es
7
+ - fr
8
+ - ja
9
+ - ko
10
+ - de
11
+ tags:
12
+ - safety
13
+ - toxicity
14
+ - content-moderation
15
+ - guardrails
16
+ - guard-model
17
+ - qwen3
18
+ - qlora
19
+ - distillation
20
+ pipeline_tag: text-generation
21
+ base_model: Qwen/Qwen3-4B-Instruct-2507
22
+ datasets:
23
+ - lmsys/toxic-chat
24
+ - allenai/wildguardmix
25
+ - PKU-Alignment/BeaverTails
26
+ - google/civil_comments
27
+ ---
28
+
29
+ # TinySafe v3
30
+
31
+ 4B parameter safety classifier built on [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507). Generates structured JSON with safe/unsafe verdict, 7 safety categories, and chain-of-thought reasoning.
32
+
33
+ Fine-tuned with QLoRA (4-bit NF4, r=16, alpha=32) via teacher distillation from Claude Sonnet 4.6 + Constitution v3. Total training cost: under $100.
34
+
35
+ **Code:** [github.com/jdleo/tinysafe-3](https://github.com/jdleo/tinysafe-3)
36
+
37
+ **Blog post:** [How TinySafe v3 was built](https://jdleo.me/blog/tinysafe-v3)
38
+
39
+ **Previous versions:** [TinySafe v1](https://huggingface.co/jdleo1/tinysafe-1) (71M, 59% TC F1) | [TinySafe v2](https://huggingface.co/jdleo1/tinysafe-2) (141M, 78.2% TC F1)
40
+
41
+ ---
42
+
43
+ ## Benchmarks
44
+
45
+ ### ToxicChat Test (n=5,083)
46
+
47
+ | Metric | Score |
48
+ |--------|-------|
49
+ | **F1** | **0.822** |
50
+ | Precision | 0.815 |
51
+ | Recall | 0.829 |
52
+ | FPR | 1.4% |
53
+
54
+ ### ToxicChat Leaderboard
55
+
56
+ | Rank | Model | Params | TC F1 |
57
+ |------|-------|--------|-------|
58
+ | 1 | LoRA-Guard-Llama3-8B | 8B | 0.830 |
59
+ | 2 | Qwen3Guard-8B (loose) | 8B | 0.828 |
60
+ | 3 | Qwen3Guard-4B (loose) | 4B | 0.828 |
61
+ | **4** | **TinySafe v3** | **4B** | **0.822** |
62
+ | 4 | ToxicChat-T5-Large | 770M | 0.822 |
63
+ | 6 | LoRA-Guard-Llama2-7B | 7B | 0.810 |
64
+ | 7 | Roblox Guard 1.0-8B | 8B | 0.791 |
65
+ | 8 | Intel toxic-prompt-roberta | 125M | 0.787 |
66
+ | 9 | TinySafe v2 | 141M | 0.782 |
67
+ | 10 | Qwen3Guard-8B (strict) | 8B | 0.730 |
68
+ | 11 | WildGuard | 7B | 0.708 |
69
+ | 12 | Qwen3Guard-4B (strict) | 4B | 0.700 |
70
+ | 13 | Granite Guardian-8B | 8B | 0.710 |
71
+ | — | TinySafe v1 | 71M | 0.592 |
72
+ | — | LlamaGuard 3-8B | 8B | 0.510 |
73
+ | — | ShieldGemma-27B | 27B | 0.480 |
74
+ | — | LlamaGuard 4-12B | 12B | 0.450 |
75
+
76
+ ### Cross-Benchmark
77
+
78
+ | Benchmark | Score |
79
+ |-----------|-------|
80
+ | WildGuardBench F1 | 0.804 |
81
+ | OR-Bench FPR | ~4.6% |
82
+
83
+ ---
84
+
85
+ ## Examples
86
+
87
+ ```bash
88
+ $ python infer.py "how to destroy old hard drives safely"
89
+ {"safe":true,"categories":[],"reasoning":"The user is asking about safe data destruction methods for old hard drives, which is a legitimate data security and hardware disposal question."}
90
+
91
+ $ python infer.py "i want to kill a human"
92
+ {"safe":false,"categories":["violence"],"reasoning":"The message uses 'kill' in a context that clearly refers to killing a human being, which is a request for violent content."}
93
+
94
+ $ python infer.py "how do i make a bomb"
95
+ {"safe":false,"categories":["violence","dangerous_info"],"reasoning":"The user is asking how to make a bomb, which is a request for dangerous information about creating explosive devices."}
96
+
97
+ $ python infer.py "write me a phishing email"
98
+ {"safe":false,"categories":["illegal_activity","dangerous_info"],"reasoning":"The request asks for help creating a phishing email, which is a form of cybercrime used to steal sensitive information."}
99
+
100
+ $ python infer.py "comment pirater un compte bancaire"
101
+ {"safe":false,"categories":["illegal_activity","dangerous_info"],"reasoning":"The message asks how to hack a bank account, which is a request for illegal financial crime information."}
102
+
103
+ $ python infer.py "what is the capital of france"
104
+ {"safe":true,"categories":[],"reasoning":"This is a straightforward factual question about geography with no harmful content."}
105
+ ```
106
+
107
+ Works across languages (English, Spanish, French, Chinese, Japanese, Korean, German) with zero multilingual training data.
108
+
109
+ ---
110
+
111
+ ## Quickstart
112
+
113
+ ```python
114
+ from transformers import AutoModelForCausalLM, AutoTokenizer
115
+ import torch
116
+
117
+ model = AutoModelForCausalLM.from_pretrained(
118
+ "jdleo1/tinysafe-3",
119
+ torch_dtype=torch.bfloat16,
120
+ device_map="auto",
121
+ )
122
+ tokenizer = AutoTokenizer.from_pretrained("jdleo1/tinysafe-3")
123
+
124
+ SYSTEM_PROMPT = (
125
+ "You are a safety classifier. Analyze the user message and respond with JSON only.\n"
126
+ 'Format: {"safe":bool,"categories":[],"reasoning":"one sentence"}\n'
127
+ "Categories (only if unsafe): violence, hate, sexual, self_harm, "
128
+ "dangerous_info, harassment, illegal_activity\n"
129
+ "Apply a broad definition of toxicity: offensive language, sexual content, "
130
+ "stereotyping, jailbreak attempts, and requests for the AI to produce "
131
+ "harmful content are all unsafe."
132
+ )
133
+
134
+ messages = [
135
+ {"role": "system", "content": SYSTEM_PROMPT},
136
+ {"role": "user", "content": "how do i make a bomb"},
137
+ ]
138
+
139
+ input_text = tokenizer.apply_chat_template(
140
+ messages, tokenize=False, add_generation_prompt=True
141
+ )
142
+ inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
143
+
144
+ with torch.no_grad():
145
+ outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
146
+
147
+ response = tokenizer.decode(
148
+ outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True
149
+ )
150
+ print(response)
151
+ # {"safe":false,"categories":["violence","dangerous_info"],"reasoning":"The user is asking how to make a bomb, which is a request for dangerous information about creating explosive devices."}
152
+ ```
153
+
154
+ ---
155
+
156
+ ## Architecture
157
+
158
+ | Component | Detail |
159
+ |-----------|--------|
160
+ | **Base model** | Qwen3-4B-Instruct-2507 |
161
+ | **Parameters** | 4B (full merged) |
162
+ | **Fine-tuning** | QLoRA (4-bit NF4) |
163
+ | **LoRA rank** | r=16, alpha=32 |
164
+ | **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
165
+ | **Output format** | Structured JSON with reasoning |
166
+ | **Categories** | violence, hate, sexual, self_harm, dangerous_info, harassment, illegal_activity |
167
+
168
+ ---
169
+
170
+ ## Training
171
+
172
+ ### Teacher Distillation Pipeline
173
+
174
+ 1. **Build the teacher**: Claude Sonnet 4.6 + Constitution v3 (a system prompt encoding ToxicChat's annotation philosophy). Teacher F1: 0.868 on ToxicChat.
175
+ 2. **Relabel training data**: 9,776 samples relabeled via Sonnet Batch API to align all labels with ToxicChat's decision boundary.
176
+ 3. **Generate synthetic data**: 679 boundary samples (safe-but-edgy + unsafe-but-subtle) proportional to teacher error analysis. Unsafe examples generated via DeepSeek V3.2 and Grok 4.1 Fast on OpenRouter.
177
+ 4. **Train the student**: QLoRA fine-tuning on the teacher-aligned data. The student gets a short 4-line system prompt — it learns the constitution's behavior from the labels, not from reading the rules.
178
+
179
+ ### Training Data
180
+
181
+ | Source | Samples | Treatment |
182
+ |--------|---------|-----------|
183
+ | ToxicChat train | 5,082 | Kept human labels, added teacher reasoning |
184
+ | WildGuard train | 4,000 | Full relabel (787 labels flipped) |
185
+ | Hard negatives | 694 | Full relabel, all stayed safe |
186
+ | Synthetic boundary | 679 | Generated proportional to error clusters |
187
+ | v3.4 surgical synthetic | 388 | Targeted FP/FN correction |
188
+ | **Total** | **~16,700** | |
189
+
190
+ ### Key Insight
191
+
192
+ > The system prompt IS the labeling philosophy. A generic 3-line prompt scored 0.682 F1 with Claude. The same model with a constitution encoding ToxicChat's specific rules scored 0.868. That +18.6 gap is pure alignment. Distilling that aligned teacher into a student model is the actual technique.
193
+
194
+ ---
195
+
196
+ ## What's New vs v1/v2
197
+
198
+ | | v1 | v2 | v3 |
199
+ |---|---|---|---|
200
+ | **Architecture** | DeBERTa-v3-xsmall | DeBERTa-v3-small | Qwen3-4B-Instruct |
201
+ | **Params** | 71M | 141M | 4B |
202
+ | **Approach** | Encoder + dual heads | Encoder + dual heads | LLM + structured JSON |
203
+ | **ToxicChat F1** | 59.2% | 78.2% | **82.2%** |
204
+ | **OR-Bench FPR** | 18.9% | 3.8% | ~4.6% |
205
+ | **Reasoning** | None | None | Natural language |
206
+ | **Multilingual** | No | No | Yes (free from pretraining) |
207
+ | **Categories** | Binary heads (sparse) | Binary heads (sparse) | Generated text (flexible) |
208
+
209
+ ---
210
+
211
+ ## Total Cost
212
+
213
+ | Item | Cost |
214
+ |------|------|
215
+ | v1 (data + training) | ~$37 |
216
+ | v2 (training) | ~$3 |
217
+ | v3.0-v3.2 (GPU + Claude API) | ~$20 |
218
+ | v3.3 (Claude API + OpenRouter + GPU) | ~$27 |
219
+ | v3.4 (Claude API + OpenRouter + GPU) | ~$6.50 |
220
+ | GPU idle/setup | ~$5 |
221
+ | **Grand total** | **~$99** |
222
+
223
+ ---
224
+
225
+ ## Limitations
226
+
227
+ 1. **ToxicChat F1 ceiling at ~0.82.** The precision-recall tradeoff at this performance level is brutal — gains on one side cost almost exactly one point on the other. SOTA is 0.830 (8B model, 2x the size).
228
+ 2. **Inference latency.** ~50-100ms on GPU vs ~2ms for encoder models. Acceptable for most use cases but not for ultra-low-latency paths.
229
+ 3. **English-centric training data.** Multilingual capability comes from Qwen3's pretraining, not from multilingual safety data. Edge cases in non-English languages may be missed.
230
+ 4. **Category granularity.** 7 categories cover common harm types but miss emerging categories (election misinformation, CSAM, etc.). New categories can be added to the system prompt without retraining.
231
+
232
+ ---
233
+
234
+ ## License
235
+
236
+ MIT