lennarddaw
/

howzer-severity-transformer

@@ -64,33 +64,9 @@ Rule-based severity engines fail on paraphrasing, implicit harm, and contextual
 ### Architecture
-```
-Input: German text (max 128 tokens)
-         |
-    gbert-large Tokenizer (31K vocab, WordPiece)
-         |
-    gbert-large Encoder (24 layers, 1024-dim)
-         |
-    [CLS] token embedding (1024-dim)
-         |
-    ┌─── Severity Encoder ───┐
-    │ Linear(1024→256) + LN + GELU + Dropout(0.1) │
-    │ Linear(256→128)  + LN + GELU + Dropout(0.1) │
-    └────────┬───────────────┘
-             |
-    ┌────────┼────────────────┐
-    |        |                |
-┌───┴───┐ ┌─┴──────┐  ┌─────┴─────┐
-│ Score │ │  Dims  │  │   Tier    │
-│ Head  │ │  Head  │  │   Head    │
-│128→64 │ │128→64  │  │ 128→64   │
-│→1     │ │→4      │  │ →4       │
-│sigmoid│ │sigmoid │  │ (logits) │
-└───┬───┘ └───┬────┘  └────┬─────┘
-    |         |             |
-severity  dimensions   tier_logits
- [0..1]   [0..1] x 4   4-class
-```
 | Property | Value |
 |---|---|
@@ -190,49 +166,21 @@ Golden set performance matches test set — no overfitting to easy cases.
 We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.
-### Overall Comparison
-```
-                                    Tier F1 (weighted)
-                          0.0   0.2   0.4   0.6   0.8   1.0
-                           |     |     |     |     |     |
-  HowzerSeverity (ours)   ████████████████████████████████████████ 1.000  <-- this model
-  Claude Sonnet 4.6        ███████████████████████████████████████  0.981
-  Claude Opus 4.6          ███████████████████████████████████      0.885
-  Claude Haiku 4.5         ████████████████████████████████         0.811
-  mDeBERTa XNLI (0-shot)  ██████████████████                       0.456
-  nlptown Stars (mapped)   █████████████████                        0.429
-  German Sent. BERT (map)  ███████████                              0.287
-  BART MNLI (0-shot)       ████████                                 0.211
-```
-```
-                                    Tier Accuracy
-                          0%   20%   40%   60%   80%  100%
-                           |     |     |     |     |     |
-  HowzerSeverity (ours)   ████████████████████████████████████████ 100.0%
-  Claude Sonnet 4.6        ███████████████████████████████████████   97.9%
-  Claude Opus 4.6          ███████████████████████████████████       87.5%
-  Claude Haiku 4.5         ████████████████████████████████          81.2%
-  mDeBERTa XNLI (0-shot)  █████████████████                         43.8%
-  nlptown Stars (mapped)   ███████████████                           37.5%
-  German Sent. BERT (map)  ██████████                                27.1%
-  BART MNLI (0-shot)       ██████                                    16.7%
-```
-```
-                                    Score MAE (lower is better)
-                          0.00       0.10       0.20       0.30
-                           |          |          |          |
-  Claude Sonnet 4.6        █                                       0.006  (best)
-  HowzerSeverity (ours)    ████                                    0.030
-  Claude Haiku 4.5          █████                                  0.038
-  Claude Opus 4.6           █████████                              0.065
-  mDeBERTa XNLI (0-shot)   ████████████████████████                0.163
-  German Sent. BERT (map)   ██████████████████████████████          0.190
-  nlptown Stars (mapped)    ████████████████████████████████        0.212
-  BART MNLI (0-shot)        ███████████████████████████████████     0.234
-```
 ### Detailed Metrics Table
@@ -251,19 +199,9 @@ We compared this 336M-parameter fine-tuned model against general-purpose LLMs an
 ### Per-Tier F1 Breakdown
-```
-                   low       medium     high      critical
-                  ┌─────────┬─────────┬─────────┬─────────┐
-  Howzer (ours)   │  1.000  │  1.000  │  1.000  │  1.000  │
-  Sonnet 4.6      │  1.000  │  0.973  │  0.800  │  1.000  │
-  Opus 4.6        │  0.960  │  0.813  │  0.500  │  1.000  │
-  Haiku 4.5       │  0.840  │  0.778  │  0.500  │  1.000  │
-  mDeBERTa XNLI   │  0.681  │  0.273  │  0.182  │  0.000  │
-  nlptown Stars   │  0.615  │  0.250  │  0.000  │  0.353  │
-  Germ. Sent.     │  0.564  │  0.000  │  0.114  │  0.000  │
-  BART MNLI       │  0.267  │  0.174  │  0.000  │  0.133  │
-                  └─────────┴─────────┴─────────┴─────────┘
-```
 Key observations:
 - **All models struggle most with `high` tier** (only 2 samples in test set; subtle boundary between medium and high)
@@ -272,45 +210,9 @@ Key observations:
 ### Confusion Matrices
-**HowzerSeverity (ours) — 336M, fine-tuned:**
-```
-              Predicted
-              low    med    high   crit
-True low      24      0      0      0       Perfect
-True med       0     19      0      0       Perfect
-True high      0      0      2      0       Perfect
-True crit      0      0      0      3       Perfect
-```
-**Claude Sonnet 4.6 — LLM (~70B?):**
-```
-              Predicted
-              low    med    high   crit
-True low      24      0      0      0       Perfect
-True med       0     18      1      0       1 med→high
-True high      0      0      2      0       Perfect
-True crit      0      0      0      3       Perfect
-```
-**Claude Opus 4.6 — LLM (~70B?):**
-```
-              Predicted
-              low    med    high   crit
-True low      24      0      0      0       Perfect
-True med       2     13      4      0       6 errors
-True high      0      0      2      0       Perfect
-True crit      0      0      0      3       Perfect
-```
-**Claude Haiku 4.5 — LLM (~8B?):**
-```
-              Predicted
-              low    med    high   crit
-True low      21      3      0      0       3 low→med
-True med       4     14      1      0       5 errors
-True high      1      0      1      0       1 high→low
-True crit      0      0      0      3       Perfect
-```
 ### Why LLMs Struggle with Severity Scoring
@@ -325,6 +227,10 @@ Even the best LLMs make systematic errors that our fine-tuned model avoids:
 ### Cost & Latency Comparison
 | Model | Latency | Cost / 1K samples | Offline | Privacy |
 |---|---|---|---|---|
 | **HowzerSeverity (ours)** | **~306ms** | **$0** | **Yes** | **Yes** |
@@ -518,7 +424,7 @@ L = 1.5 * MSE(severity_score) + 0.8 * MSE(dimensions) + 3.0 * FocalLoss(tier_log
 ```bibtex
 @misc{howzer-severity-transformer-2026,
   title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
-  author={Lennard Dawson},
   year={2026},
   publisher={Hugging Face},
   url={https://huggingface.co/lennarddaw/howzer-severity-transformer},

 ### Architecture
+<p align="center">
+  <img src="images/architecture.png" alt="HowzerSeverityTransformer Architecture" width="700">
+</p>
 | Property | Value |
 |---|---|
 We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.
+### Summary
+<p align="center">
+  <img src="images/summary_card.png" alt="Benchmark Summary" width="800">
+</p>
+### Overall Comparison
+<p align="center">
+  <img src="images/f1_comparison.png" alt="F1 Score Comparison" width="750">
+</p>
+<p align="center">
+  <img src="images/mae_comparison.png" alt="MAE Comparison" width="750">
+</p>
 ### Detailed Metrics Table
 ### Per-Tier F1 Breakdown
+<p align="center">
+  <img src="images/tier_f1_heatmap.png" alt="Per-Tier F1 Heatmap" width="750">
+</p>
 Key observations:
 - **All models struggle most with `high` tier** (only 2 samples in test set; subtle boundary between medium and high)
 ### Confusion Matrices
+<p align="center">
+  <img src="images/confusion_matrices.png" alt="Confusion Matrices" width="800">
+</p>
 ### Why LLMs Struggle with Severity Scoring
 ### Cost & Latency Comparison
+<p align="center">
+  <img src="images/cost_vs_f1.png" alt="Cost vs F1 Comparison" width="750">
+</p>
 | Model | Latency | Cost / 1K samples | Offline | Privacy |
 |---|---|---|---|---|
 | **HowzerSeverity (ours)** | **~306ms** | **$0** | **Yes** | **Yes** |
 ```bibtex
 @misc{howzer-severity-transformer-2026,
   title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
+  author={Lennard Gross},
   year={2026},
   publisher={Hugging Face},
   url={https://huggingface.co/lennarddaw/howzer-severity-transformer},