Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -64,33 +64,9 @@ Rule-based severity engines fail on paraphrasing, implicit harm, and contextual
|
|
| 64 |
|
| 65 |
### Architecture
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
gbert-large Tokenizer (31K vocab, WordPiece)
|
| 71 |
-
|
|
| 72 |
-
gbert-large Encoder (24 layers, 1024-dim)
|
| 73 |
-
|
|
| 74 |
-
[CLS] token embedding (1024-dim)
|
| 75 |
-
|
|
| 76 |
-
ββββ Severity Encoder ββββ
|
| 77 |
-
β Linear(1024β256) + LN + GELU + Dropout(0.1) β
|
| 78 |
-
β Linear(256β128) + LN + GELU + Dropout(0.1) β
|
| 79 |
-
ββββββββββ¬ββββββββββββββββ
|
| 80 |
-
|
|
| 81 |
-
ββββββββββΌβββββββββββββββββ
|
| 82 |
-
| | |
|
| 83 |
-
βββββ΄ββββ βββ΄βββββββ βββββββ΄ββββββ
|
| 84 |
-
β Score β β Dims β β Tier β
|
| 85 |
-
β Head β β Head β β Head β
|
| 86 |
-
β128β64 β β128β64 β β 128β64 β
|
| 87 |
-
ββ1 β ββ4 β β β4 β
|
| 88 |
-
βsigmoidβ βsigmoid β β (logits) β
|
| 89 |
-
βββββ¬ββββ βββββ¬βββββ ββββββ¬ββββββ
|
| 90 |
-
| | |
|
| 91 |
-
severity dimensions tier_logits
|
| 92 |
-
[0..1] [0..1] x 4 4-class
|
| 93 |
-
```
|
| 94 |
|
| 95 |
| Property | Value |
|
| 96 |
|---|---|
|
|
@@ -190,49 +166,21 @@ Golden set performance matches test set β no overfitting to easy cases.
|
|
| 190 |
|
| 191 |
We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.
|
| 192 |
|
| 193 |
-
###
|
| 194 |
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
| | | | | |
|
| 199 |
-
HowzerSeverity (ours) ββββββββββββββββββββββββββββββββββββββββ 1.000 <-- this model
|
| 200 |
-
Claude Sonnet 4.6 βββββββββββββββββββββββββββββββββββββββ 0.981
|
| 201 |
-
Claude Opus 4.6 βββββββββββββββββββββββββββββββββββ 0.885
|
| 202 |
-
Claude Haiku 4.5 ββββββββββββββββββββββββββββββββ 0.811
|
| 203 |
-
mDeBERTa XNLI (0-shot) ββββββββββββββββββ 0.456
|
| 204 |
-
nlptown Stars (mapped) βββββββββββββββββ 0.429
|
| 205 |
-
German Sent. BERT (map) βββββββββββ 0.287
|
| 206 |
-
BART MNLI (0-shot) ββββββββ 0.211
|
| 207 |
-
```
|
| 208 |
|
| 209 |
-
|
| 210 |
-
Tier Accuracy
|
| 211 |
-
0% 20% 40% 60% 80% 100%
|
| 212 |
-
| | | | | |
|
| 213 |
-
HowzerSeverity (ours) ββββββββββββββββββββββββββββββββββββββββ 100.0%
|
| 214 |
-
Claude Sonnet 4.6 βββββββββββββββββββββββββββββββββββββββ 97.9%
|
| 215 |
-
Claude Opus 4.6 βββββββββββββββββββββββββββββββββββ 87.5%
|
| 216 |
-
Claude Haiku 4.5 ββββββββββββββββββββββββββββββββ 81.2%
|
| 217 |
-
mDeBERTa XNLI (0-shot) βββββββββββββββββ 43.8%
|
| 218 |
-
nlptown Stars (mapped) βββββββββββββββ 37.5%
|
| 219 |
-
German Sent. BERT (map) ββββββββββ 27.1%
|
| 220 |
-
BART MNLI (0-shot) ββββββ 16.7%
|
| 221 |
-
```
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
Claude Opus 4.6 βββββββββ 0.065
|
| 231 |
-
mDeBERTa XNLI (0-shot) ββββββββββββββββββββββββ 0.163
|
| 232 |
-
German Sent. BERT (map) ββββββββββββββββββββββββββββββ 0.190
|
| 233 |
-
nlptown Stars (mapped) ββββββββββββββββββββββββββββββββ 0.212
|
| 234 |
-
BART MNLI (0-shot) βββββββββββββββββββββββββββββββββββ 0.234
|
| 235 |
-
```
|
| 236 |
|
| 237 |
### Detailed Metrics Table
|
| 238 |
|
|
@@ -251,19 +199,9 @@ We compared this 336M-parameter fine-tuned model against general-purpose LLMs an
|
|
| 251 |
|
| 252 |
### Per-Tier F1 Breakdown
|
| 253 |
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
Howzer (ours) β 1.000 β 1.000 β 1.000 β 1.000 β
|
| 258 |
-
Sonnet 4.6 β 1.000 β 0.973 β 0.800 β 1.000 β
|
| 259 |
-
Opus 4.6 β 0.960 β 0.813 β 0.500 β 1.000 β
|
| 260 |
-
Haiku 4.5 β 0.840 β 0.778 β 0.500 β 1.000 β
|
| 261 |
-
mDeBERTa XNLI β 0.681 β 0.273 β 0.182 β 0.000 β
|
| 262 |
-
nlptown Stars β 0.615 β 0.250 β 0.000 β 0.353 β
|
| 263 |
-
Germ. Sent. β 0.564 β 0.000 β 0.114 β 0.000 β
|
| 264 |
-
BART MNLI β 0.267 β 0.174 β 0.000 β 0.133 β
|
| 265 |
-
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
|
| 266 |
-
```
|
| 267 |
|
| 268 |
Key observations:
|
| 269 |
- **All models struggle most with `high` tier** (only 2 samples in test set; subtle boundary between medium and high)
|
|
@@ -272,45 +210,9 @@ Key observations:
|
|
| 272 |
|
| 273 |
### Confusion Matrices
|
| 274 |
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
low med high crit
|
| 279 |
-
True low 24 0 0 0 Perfect
|
| 280 |
-
True med 0 19 0 0 Perfect
|
| 281 |
-
True high 0 0 2 0 Perfect
|
| 282 |
-
True crit 0 0 0 3 Perfect
|
| 283 |
-
```
|
| 284 |
-
|
| 285 |
-
**Claude Sonnet 4.6 β LLM (~70B?):**
|
| 286 |
-
```
|
| 287 |
-
Predicted
|
| 288 |
-
low med high crit
|
| 289 |
-
True low 24 0 0 0 Perfect
|
| 290 |
-
True med 0 18 1 0 1 medβhigh
|
| 291 |
-
True high 0 0 2 0 Perfect
|
| 292 |
-
True crit 0 0 0 3 Perfect
|
| 293 |
-
```
|
| 294 |
-
|
| 295 |
-
**Claude Opus 4.6 β LLM (~70B?):**
|
| 296 |
-
```
|
| 297 |
-
Predicted
|
| 298 |
-
low med high crit
|
| 299 |
-
True low 24 0 0 0 Perfect
|
| 300 |
-
True med 2 13 4 0 6 errors
|
| 301 |
-
True high 0 0 2 0 Perfect
|
| 302 |
-
True crit 0 0 0 3 Perfect
|
| 303 |
-
```
|
| 304 |
-
|
| 305 |
-
**Claude Haiku 4.5 β LLM (~8B?):**
|
| 306 |
-
```
|
| 307 |
-
Predicted
|
| 308 |
-
low med high crit
|
| 309 |
-
True low 21 3 0 0 3 lowβmed
|
| 310 |
-
True med 4 14 1 0 5 errors
|
| 311 |
-
True high 1 0 1 0 1 highβlow
|
| 312 |
-
True crit 0 0 0 3 Perfect
|
| 313 |
-
```
|
| 314 |
|
| 315 |
### Why LLMs Struggle with Severity Scoring
|
| 316 |
|
|
@@ -325,6 +227,10 @@ Even the best LLMs make systematic errors that our fine-tuned model avoids:
|
|
| 325 |
|
| 326 |
### Cost & Latency Comparison
|
| 327 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 328 |
| Model | Latency | Cost / 1K samples | Offline | Privacy |
|
| 329 |
|---|---|---|---|---|
|
| 330 |
| **HowzerSeverity (ours)** | **~306ms** | **$0** | **Yes** | **Yes** |
|
|
@@ -518,7 +424,7 @@ L = 1.5 * MSE(severity_score) + 0.8 * MSE(dimensions) + 3.0 * FocalLoss(tier_log
|
|
| 518 |
```bibtex
|
| 519 |
@misc{howzer-severity-transformer-2026,
|
| 520 |
title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
|
| 521 |
-
author={Lennard
|
| 522 |
year={2026},
|
| 523 |
publisher={Hugging Face},
|
| 524 |
url={https://huggingface.co/lennarddaw/howzer-severity-transformer},
|
|
|
|
| 64 |
|
| 65 |
### Architecture
|
| 66 |
|
| 67 |
+
<p align="center">
|
| 68 |
+
<img src="images/architecture.png" alt="HowzerSeverityTransformer Architecture" width="700">
|
| 69 |
+
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
| Property | Value |
|
| 72 |
|---|---|
|
|
|
|
| 166 |
|
| 167 |
We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.
|
| 168 |
|
| 169 |
+
### Summary
|
| 170 |
|
| 171 |
+
<p align="center">
|
| 172 |
+
<img src="images/summary_card.png" alt="Benchmark Summary" width="800">
|
| 173 |
+
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
+
### Overall Comparison
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
+
<p align="center">
|
| 178 |
+
<img src="images/f1_comparison.png" alt="F1 Score Comparison" width="750">
|
| 179 |
+
</p>
|
| 180 |
+
|
| 181 |
+
<p align="center">
|
| 182 |
+
<img src="images/mae_comparison.png" alt="MAE Comparison" width="750">
|
| 183 |
+
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
### Detailed Metrics Table
|
| 186 |
|
|
|
|
| 199 |
|
| 200 |
### Per-Tier F1 Breakdown
|
| 201 |
|
| 202 |
+
<p align="center">
|
| 203 |
+
<img src="images/tier_f1_heatmap.png" alt="Per-Tier F1 Heatmap" width="750">
|
| 204 |
+
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
|
| 206 |
Key observations:
|
| 207 |
- **All models struggle most with `high` tier** (only 2 samples in test set; subtle boundary between medium and high)
|
|
|
|
| 210 |
|
| 211 |
### Confusion Matrices
|
| 212 |
|
| 213 |
+
<p align="center">
|
| 214 |
+
<img src="images/confusion_matrices.png" alt="Confusion Matrices" width="800">
|
| 215 |
+
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
|
| 217 |
### Why LLMs Struggle with Severity Scoring
|
| 218 |
|
|
|
|
| 227 |
|
| 228 |
### Cost & Latency Comparison
|
| 229 |
|
| 230 |
+
<p align="center">
|
| 231 |
+
<img src="images/cost_vs_f1.png" alt="Cost vs F1 Comparison" width="750">
|
| 232 |
+
</p>
|
| 233 |
+
|
| 234 |
| Model | Latency | Cost / 1K samples | Offline | Privacy |
|
| 235 |
|---|---|---|---|---|
|
| 236 |
| **HowzerSeverity (ours)** | **~306ms** | **$0** | **Yes** | **Yes** |
|
|
|
|
| 424 |
```bibtex
|
| 425 |
@misc{howzer-severity-transformer-2026,
|
| 426 |
title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
|
| 427 |
+
author={Lennard Gross},
|
| 428 |
year={2026},
|
| 429 |
publisher={Hugging Face},
|
| 430 |
url={https://huggingface.co/lennarddaw/howzer-severity-transformer},
|