lennarddaw commited on
Commit
25e3a04
Β·
verified Β·
1 Parent(s): 48071da

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +26 -120
README.md CHANGED
@@ -64,33 +64,9 @@ Rule-based severity engines fail on paraphrasing, implicit harm, and contextual
64
 
65
  ### Architecture
66
 
67
- ```
68
- Input: German text (max 128 tokens)
69
- |
70
- gbert-large Tokenizer (31K vocab, WordPiece)
71
- |
72
- gbert-large Encoder (24 layers, 1024-dim)
73
- |
74
- [CLS] token embedding (1024-dim)
75
- |
76
- β”Œβ”€β”€β”€ Severity Encoder ───┐
77
- β”‚ Linear(1024β†’256) + LN + GELU + Dropout(0.1) β”‚
78
- β”‚ Linear(256β†’128) + LN + GELU + Dropout(0.1) β”‚
79
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
80
- |
81
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
82
- | | |
83
- β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
84
- β”‚ Score β”‚ β”‚ Dims β”‚ β”‚ Tier β”‚
85
- β”‚ Head β”‚ β”‚ Head β”‚ β”‚ Head β”‚
86
- β”‚128β†’64 β”‚ β”‚128β†’64 β”‚ β”‚ 128β†’64 β”‚
87
- β”‚β†’1 β”‚ β”‚β†’4 β”‚ β”‚ β†’4 β”‚
88
- β”‚sigmoidβ”‚ β”‚sigmoid β”‚ β”‚ (logits) β”‚
89
- β””β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
90
- | | |
91
- severity dimensions tier_logits
92
- [0..1] [0..1] x 4 4-class
93
- ```
94
 
95
  | Property | Value |
96
  |---|---|
@@ -190,49 +166,21 @@ Golden set performance matches test set β€” no overfitting to easy cases.
190
 
191
  We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.
192
 
193
- ### Overall Comparison
194
 
195
- ```
196
- Tier F1 (weighted)
197
- 0.0 0.2 0.4 0.6 0.8 1.0
198
- | | | | | |
199
- HowzerSeverity (ours) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 1.000 <-- this model
200
- Claude Sonnet 4.6 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.981
201
- Claude Opus 4.6 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.885
202
- Claude Haiku 4.5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.811
203
- mDeBERTa XNLI (0-shot) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.456
204
- nlptown Stars (mapped) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.429
205
- German Sent. BERT (map) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.287
206
- BART MNLI (0-shot) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.211
207
- ```
208
 
209
- ```
210
- Tier Accuracy
211
- 0% 20% 40% 60% 80% 100%
212
- | | | | | |
213
- HowzerSeverity (ours) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 100.0%
214
- Claude Sonnet 4.6 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 97.9%
215
- Claude Opus 4.6 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 87.5%
216
- Claude Haiku 4.5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 81.2%
217
- mDeBERTa XNLI (0-shot) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 43.8%
218
- nlptown Stars (mapped) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 37.5%
219
- German Sent. BERT (map) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 27.1%
220
- BART MNLI (0-shot) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 16.7%
221
- ```
222
 
223
- ```
224
- Score MAE (lower is better)
225
- 0.00 0.10 0.20 0.30
226
- | | | |
227
- Claude Sonnet 4.6 β–ˆ 0.006 (best)
228
- HowzerSeverity (ours) β–ˆβ–ˆβ–ˆβ–ˆ 0.030
229
- Claude Haiku 4.5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.038
230
- Claude Opus 4.6 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.065
231
- mDeBERTa XNLI (0-shot) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.163
232
- German Sent. BERT (map) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.190
233
- nlptown Stars (mapped) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.212
234
- BART MNLI (0-shot) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.234
235
- ```
236
 
237
  ### Detailed Metrics Table
238
 
@@ -251,19 +199,9 @@ We compared this 336M-parameter fine-tuned model against general-purpose LLMs an
251
 
252
  ### Per-Tier F1 Breakdown
253
 
254
- ```
255
- low medium high critical
256
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
257
- Howzer (ours) β”‚ 1.000 β”‚ 1.000 β”‚ 1.000 β”‚ 1.000 β”‚
258
- Sonnet 4.6 β”‚ 1.000 β”‚ 0.973 β”‚ 0.800 β”‚ 1.000 β”‚
259
- Opus 4.6 β”‚ 0.960 β”‚ 0.813 β”‚ 0.500 β”‚ 1.000 β”‚
260
- Haiku 4.5 β”‚ 0.840 β”‚ 0.778 β”‚ 0.500 β”‚ 1.000 β”‚
261
- mDeBERTa XNLI β”‚ 0.681 β”‚ 0.273 β”‚ 0.182 β”‚ 0.000 β”‚
262
- nlptown Stars β”‚ 0.615 β”‚ 0.250 β”‚ 0.000 β”‚ 0.353 β”‚
263
- Germ. Sent. β”‚ 0.564 β”‚ 0.000 β”‚ 0.114 β”‚ 0.000 β”‚
264
- BART MNLI β”‚ 0.267 β”‚ 0.174 β”‚ 0.000 β”‚ 0.133 β”‚
265
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
266
- ```
267
 
268
  Key observations:
269
  - **All models struggle most with `high` tier** (only 2 samples in test set; subtle boundary between medium and high)
@@ -272,45 +210,9 @@ Key observations:
272
 
273
  ### Confusion Matrices
274
 
275
- **HowzerSeverity (ours) β€” 336M, fine-tuned:**
276
- ```
277
- Predicted
278
- low med high crit
279
- True low 24 0 0 0 Perfect
280
- True med 0 19 0 0 Perfect
281
- True high 0 0 2 0 Perfect
282
- True crit 0 0 0 3 Perfect
283
- ```
284
-
285
- **Claude Sonnet 4.6 β€” LLM (~70B?):**
286
- ```
287
- Predicted
288
- low med high crit
289
- True low 24 0 0 0 Perfect
290
- True med 0 18 1 0 1 med→high
291
- True high 0 0 2 0 Perfect
292
- True crit 0 0 0 3 Perfect
293
- ```
294
-
295
- **Claude Opus 4.6 β€” LLM (~70B?):**
296
- ```
297
- Predicted
298
- low med high crit
299
- True low 24 0 0 0 Perfect
300
- True med 2 13 4 0 6 errors
301
- True high 0 0 2 0 Perfect
302
- True crit 0 0 0 3 Perfect
303
- ```
304
-
305
- **Claude Haiku 4.5 β€” LLM (~8B?):**
306
- ```
307
- Predicted
308
- low med high crit
309
- True low 21 3 0 0 3 low→med
310
- True med 4 14 1 0 5 errors
311
- True high 1 0 1 0 1 high→low
312
- True crit 0 0 0 3 Perfect
313
- ```
314
 
315
  ### Why LLMs Struggle with Severity Scoring
316
 
@@ -325,6 +227,10 @@ Even the best LLMs make systematic errors that our fine-tuned model avoids:
325
 
326
  ### Cost & Latency Comparison
327
 
 
 
 
 
328
  | Model | Latency | Cost / 1K samples | Offline | Privacy |
329
  |---|---|---|---|---|
330
  | **HowzerSeverity (ours)** | **~306ms** | **$0** | **Yes** | **Yes** |
@@ -518,7 +424,7 @@ L = 1.5 * MSE(severity_score) + 0.8 * MSE(dimensions) + 3.0 * FocalLoss(tier_log
518
  ```bibtex
519
  @misc{howzer-severity-transformer-2026,
520
  title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
521
- author={Lennard Dawson},
522
  year={2026},
523
  publisher={Hugging Face},
524
  url={https://huggingface.co/lennarddaw/howzer-severity-transformer},
 
64
 
65
  ### Architecture
66
 
67
+ <p align="center">
68
+ <img src="images/architecture.png" alt="HowzerSeverityTransformer Architecture" width="700">
69
+ </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  | Property | Value |
72
  |---|---|
 
166
 
167
  We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.
168
 
169
+ ### Summary
170
 
171
+ <p align="center">
172
+ <img src="images/summary_card.png" alt="Benchmark Summary" width="800">
173
+ </p>
 
 
 
 
 
 
 
 
 
 
174
 
175
+ ### Overall Comparison
 
 
 
 
 
 
 
 
 
 
 
 
176
 
177
+ <p align="center">
178
+ <img src="images/f1_comparison.png" alt="F1 Score Comparison" width="750">
179
+ </p>
180
+
181
+ <p align="center">
182
+ <img src="images/mae_comparison.png" alt="MAE Comparison" width="750">
183
+ </p>
 
 
 
 
 
 
184
 
185
  ### Detailed Metrics Table
186
 
 
199
 
200
  ### Per-Tier F1 Breakdown
201
 
202
+ <p align="center">
203
+ <img src="images/tier_f1_heatmap.png" alt="Per-Tier F1 Heatmap" width="750">
204
+ </p>
 
 
 
 
 
 
 
 
 
 
205
 
206
  Key observations:
207
  - **All models struggle most with `high` tier** (only 2 samples in test set; subtle boundary between medium and high)
 
210
 
211
  ### Confusion Matrices
212
 
213
+ <p align="center">
214
+ <img src="images/confusion_matrices.png" alt="Confusion Matrices" width="800">
215
+ </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216
 
217
  ### Why LLMs Struggle with Severity Scoring
218
 
 
227
 
228
  ### Cost & Latency Comparison
229
 
230
+ <p align="center">
231
+ <img src="images/cost_vs_f1.png" alt="Cost vs F1 Comparison" width="750">
232
+ </p>
233
+
234
  | Model | Latency | Cost / 1K samples | Offline | Privacy |
235
  |---|---|---|---|---|
236
  | **HowzerSeverity (ours)** | **~306ms** | **$0** | **Yes** | **Yes** |
 
424
  ```bibtex
425
  @misc{howzer-severity-transformer-2026,
426
  title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
427
+ author={Lennard Gross},
428
  year={2026},
429
  publisher={Hugging Face},
430
  url={https://huggingface.co/lennarddaw/howzer-severity-transformer},