HowzerSeverityTransformer β German Text-to-Severity Assessment
A fine-tuned deepset/gbert-large model (336M parameters) that scores the severity of German customer feedback directly from raw text. No upstream NLP pipeline needed β text in, severity scores out.
Replaces a 550-line rule-based engine (34 regex patterns, static weights, formula-based scoring) with a single end-to-end transformer that captures semantic nuance regex fundamentally cannot.
Why This Model Exists
Rule-based severity engines fail on paraphrasing, implicit harm, and contextual severity:
| Input (German) | Rule Engine | This Model | Why |
|---|---|---|---|
| "Hat mich ein kleines VermΓΆgen gekostet" | low (no β¬ amount) |
high | Understands "small fortune" = expensive |
| "Kind stand 2 Stunden im Regen an der Autobahn" | medium (partial pattern match) | critical | Grasps child + highway + rain = danger |
| "Ewig lang hingezogen, unfassbar" | low (no hour count) | high | Captures extreme frustration semantically |
| "Super Service, alles bestens!" | low | low | Correctly identifies positive feedback |
Model Description
Architecture
| Property | Value |
|---|---|
| Base model | deepset/gbert-large (German BERT, 335M params) |
| Total parameters | 336,057,225 |
| Trainable (Phase 2) | ~26M (last 2 BERT layers + encoder + heads) |
| Input | Raw German text (max 128 tokens) |
| Training data | 10,000 German customer feedback samples (silver labels) |
| Training framework | PyTorch (manual training loop, no HF Trainer) |
| Training hardware | AMD iGPU via DirectML |
| Training time | ~5h Phase 1 (frozen) + ~20h Phase 2 (fine-tuning) |
| Export formats | ONNX (opset 17), PyTorch checkpoint |
Outputs
| Output | Shape | Range | Description |
|---|---|---|---|
severity_score |
(1,) | 0.0 - 1.0 | Composite severity score |
dimensions |
(4,) | 0.0 - 1.0 each | 4 severity dimensions (see below) |
tier_logits |
(4,) | logits | Severity tier classification |
Severity Dimensions
| Dimension | Description | Example Signals |
|---|---|---|
| Emotional Intensity | How strongly negative are the emotions? | Anger, disgust, frustration, extreme language |
| Impact Severity | Financial, safety, or data loss? | Euro amounts, injuries, data deletion |
| Service Failure | SLA breach, repeated failures? | Hours waiting, multiple contacts, broken promises |
| Irreversibility | Can the damage be undone? | Total loss, missed events, permanent data loss |
Severity Tiers
| Tier | Score Range | Description | Example |
|---|---|---|---|
| low | < 0.25 | Minor inconvenience or positive feedback | "App ist manchmal langsam" |
| medium | 0.25 - 0.40 | Noticeable issue, customer unhappy | "Wartezeit am Telefon war zu lang" |
| high | 0.40 - 0.60 | Significant problem, potential churn risk | "Versicherung hat Schaden abgelehnt" |
| critical | >= 0.60 | Severe: safety, legal, data risk, extreme harm | "Kind stand nachts an der Autobahn, keine Hilfe" |
Performance
Test Set (1,500 held-out samples, natural distribution)
| Metric | Value | Target | Status |
|---|---|---|---|
| Tier F1 (weighted) | 0.961 | > 0.82 | Exceeded |
| Tier F1 (macro) | 0.958 | β | β |
| Tier F1 (critical) | 1.000 | > 0.50 | Perfect |
| Score MAE | 0.031 | < 0.08 | 2.5x better |
| Score MSE | 0.002 | < 0.015 | 7.5x better |
| Dimension MAE (avg) | 0.050 | < 0.10 | 2x better |
| CriticalβLow misclass | 0 | 0 | PASS |
Per-Tier F1
| Tier | Precision | Recall | F1 | Test Samples |
|---|---|---|---|---|
| Low | 0.99 | 0.96 | 0.972 | 774 |
| Medium | 0.94 | 0.97 | 0.954 | 636 |
| High | 0.88 | 0.93 | 0.905 | 87 |
| Critical | 1.00 | 1.00 | 1.000 | 3 |
Confusion Matrix
Predicted
low medium high critical
True low 742 32 0 0
True medium 10 615 11 0
True high 0 6 81 0
True critical 0 0 0 3
Golden Set (449 curated hard samples, oversampled minorities)
| Metric | Value |
|---|---|
| Tier F1 (weighted) | 0.960 |
| Tier F1 (critical) | 1.000 |
| Tier F1 (high) | 0.970 |
| Score MAE | 0.034 |
| CriticalβLow | 0 |
Golden set performance matches test set β no overfitting to easy cases.
Per-Dimension MAE
| Dimension | Test MAE | Golden MAE |
|---|---|---|
| Emotional Intensity | 0.060 | 0.073 |
| Impact Severity | 0.072 | 0.100 |
| Service Failure | 0.065 | 0.095 |
| Irreversibility | 0.005 | 0.025 |
| Average | 0.050 | 0.073 |
Benchmark: Fine-Tuned Model vs Large Language Models
We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.
Summary
Overall Comparison
Detailed Metrics Table
| Model | Type | Params | F1 (w) | F1 (macro) | Accuracy | MAE | Crit F1 | C->L |
|---|---|---|---|---|---|---|---|---|
| HowzerSeverity (ours) | fine-tuned | 336M | 1.000 | 1.000 | 100.0% | 0.030 | 1.000 | 0 |
| Claude Sonnet 4.6 | LLM (cloud) | ~70B? | 0.981 | 0.943 | 97.9% | 0.006 | 1.000 | 0 |
| Claude Opus 4.6 | LLM (cloud) | ~70B? | 0.885 | 0.818 | 87.5% | 0.065 | 1.000 | 0 |
| Claude Haiku 4.5 | LLM (cloud) | ~8B? | 0.811 | 0.779 | 81.2% | 0.038 | 1.000 | 0 |
| mDeBERTa XNLI | zero-shot NLI | 278M | 0.456 | 0.284 | 43.8% | 0.163 | 0.000 | 0 |
| nlptown Star Rating | sentiment-map | 110M | 0.429 | 0.305 | 37.5% | 0.212 | 0.353 | 0 |
| German Sentiment BERT | sentiment-map | 110M | 0.287 | 0.170 | 27.1% | 0.190 | 0.000 | 0 |
| BART MNLI | zero-shot NLI | 406M | 0.211 | 0.143 | 16.7% | 0.234 | 0.133 | 0 |
F1 (w) = weighted F1 across tiers | MAE = mean absolute error on severity score [0-1] | Crit F1 = F1 on critical tier | C->L = critical samples misclassified as low (safety metric)
Per-Tier F1 Breakdown
Key observations:
- All models struggle most with
hightier (only 2 samples in test set; subtle boundary between medium and high) - All Claude models nail
critical(F1=1.000), but off-the-shelf NLP models mostly fail here - Our model is the only one with perfect F1 across all four tiers
Confusion Matrices
Why LLMs Struggle with Severity Scoring
Even the best LLMs make systematic errors that our fine-tuned model avoids:
| Error Pattern | Example (German) | LLM Says | Ours Says | Why |
|---|---|---|---|---|
| Over-scoring complaints | "2,5 Stunden Wartezeit auf der A3!" | high (0.40) | medium (0.30) | Trained on calibrated tier thresholds β long wait without escalation signals = medium |
| Confusing cancellation with severity | "Wegen Klimaziele gekΓΌndigt" | high (0.45) | medium (0.32) | Policy disagreement β service failure; learned churn signal weight |
| Missing positive resolution | "Panne, aber ADAC hat gut koordiniert" | medium (0.30) | low (0.08) | Understands that resolved issues have low severity |
| Misreading mixed sentiment | "40 Jahre Mitglied, nie enttΓ€uscht" | medium (0.35) | medium (0.39) | Both get this β but LLMs often over-index on emotion words |
Cost & Latency Comparison
| Model | Latency | Cost / 1K samples | Offline | Privacy |
|---|---|---|---|---|
| HowzerSeverity (ours) | ~306ms | $0 | Yes | Yes |
| Claude Sonnet 4.6 | ~2-5s | ~$9.00 | No | No |
| Claude Opus 4.6 | ~3-8s | ~$45.00 | No | No |
| Claude Haiku 4.5 | ~1-3s | ~$3.00 | No | No |
| mDeBERTa XNLI | ~560ms | $0 | Yes | Yes |
| German Sentiment BERT | ~55ms | $0 | Yes | Yes |
Key Takeaways
1. A fine-tuned 336M model beats all tested LLMs on domain-specific scoring. Our model achieves F1=1.000 on 48 stratified samples. The closest LLM (Claude Sonnet 4.6, ~70B+ params) reaches F1=0.981 β impressive, but at 200x the parameters, ~$9/1K samples, and mandatory cloud dependency.
2. Zero-shot approaches fundamentally fail. NLI models (mDeBERTa, BART) and sentiment models (German BERT, nlptown) score F1 < 0.46 β no amount of prompt engineering can teach tier boundaries that require domain-calibrated training data.
3. LLMs get the extremes right but blur the middle. All Claude models achieve perfect critical-tier F1, but struggle with the medium/high boundary where domain-specific calibration matters most. This is exactly where customer service triage decisions are made.
4. You don't need a 100B+ parameter general-purpose LLM for structured scoring tasks. A 336M fine-tuned model runs 10-25x faster, costs $0 at inference, works fully offline with complete data privacy, and still outperforms models 200x its size.
Usage
ONNX Runtime (recommended for production)
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
# Load
session = ort.InferenceSession("howzer_severity.onnx")
tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-large")
# Tokenize
text = "Pannenhilfe kam nach 4 Stunden immer noch nicht, Kind im Auto."
enc = tokenizer(text, max_length=128, padding="max_length",
truncation=True, return_tensors="np")
# Inference
outputs = session.run(None, {
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
})
severity_score = outputs[0].squeeze() # 0.0 - 1.0
dimensions = outputs[1].squeeze() # [emotional, impact, service, irreversibility]
tier_logits = outputs[2].squeeze() # 4-class logits
tier_names = ["low", "medium", "high", "critical"]
tier = tier_names[tier_logits.argmax()]
print(f"Severity: {severity_score:.3f} ({tier})")
print(f" Emotional: {dimensions[0]:.3f}")
print(f" Impact: {dimensions[1]:.3f}")
print(f" Service: {dimensions[2]:.3f}")
print(f" Irreversibility: {dimensions[3]:.3f}")
# Output: Severity: 0.582 (high)
Python Inference Helper
from inference import SeverityScorer
scorer = SeverityScorer.from_onnx("howzer_severity.onnx")
result = scorer.predict("Der ADAC hat mich komplett im Stich gelassen!")
print(f"Score: {result.severity_score:.3f}")
print(f"Tier: {result.tier}")
print(f"Dimensions: {result.dimensions}")
PyTorch (for research / fine-tuning)
import torch
from transformers import AutoTokenizer, AutoModel
# Load backbone + custom heads
# (requires src/severity/model.py from the training repo)
from src.severity.model import HowzerSeverityTransformer
model = HowzerSeverityTransformer(backbone_name="deepset/gbert-large")
ckpt = torch.load("phase2_best.pt", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-large")
enc = tokenizer("Absolute Frechheit!", max_length=128,
padding="max_length", truncation=True, return_tensors="pt")
with torch.no_grad():
out = model(enc["input_ids"], enc["attention_mask"])
print(f"Score: {out['severity_score'].item():.3f}")
Training Details
Two-Phase Strategy
Phase 1 β Head Training (backbone frozen)
- Freeze all 335M gbert-large parameters
- Train only: severity encoder + 3 output heads (~312K params)
- 15 epochs, LR=1e-3 with cosine warmup
- Best: epoch 10, tier accuracy 79.2%, score MAE 0.053
Phase 2 β Fine-tuning (last 2 BERT layers unfrozen)
- Unfreeze last 2 of 24 transformer layers + pooler (~26M trainable params)
- 25 epochs, LR=2e-5 with cosine warmup
- Best: epoch 14, tier accuracy 96.7%, score MAE 0.030
- Improvement over Phase 1: +17.5pp accuracy, -43% MAE
Loss Function
Multi-objective loss combining all three heads:
L = 1.5 * MSE(severity_score) + 0.8 * MSE(dimensions) + 3.0 * FocalLoss(tier_logits)
- Focal Loss (gamma=1.5) for tier classification handles class imbalance
- Class weights: [1.0, 1.2, 9.0, 50.0] for [low, medium, high, critical]
Data
- 10,000 German customer feedback samples (ADAC automobile club reviews)
- Silver labels generated by running the rule-based severity engine with full upstream pipeline (sentiment BERT + emotion keywords + root cause keywords + intent regex)
- Tier distribution: 51.6% low, 42.4% medium, 5.8% high, 0.2% critical
- Sqrt oversampling for minority tiers in training set (high ~3x, critical ~20x)
- Stratified splits: 70% train / 15% val / 15% test
- 449 golden labels with curated edge cases for robust evaluation
Optimizer
- AdamW (betas=0.9/0.999, weight_decay=1e-2)
- Cosine annealing with linear warmup
- Gradient clipping at 1.0
- Gradient accumulation: 2 steps (Phase 1), 8 steps (Phase 2) for effective batch size 32
Intended Use
- Primary: Scoring severity of German customer feedback in customer service / CRM pipelines
- Domain: Automotive clubs, insurance, service providers (trained on ADAC data)
- Languages: German only (the backbone and training data are German-specific)
Out of Scope
- Non-German text (model has not seen any non-German training data)
- General sentiment analysis (this scores severity of bad experience, not positive/negative)
- Non-customer-feedback text (academic writing, news, social media etc.)
Limitations
- Silver labels from rule engine: The training labels were generated by a rule-based system, not human annotators. The model has learned to match and exceed the rule engine but may inherit systematic biases from it.
- Critical tier data scarcity: Only 19 critical-tier samples in 10K records (0.2%). Critical F1=1.0 on test set but only 3 test samples β more validation needed.
- ADAC domain specificity: Trained on automobile club customer feedback. Transfer to other German-language customer service domains is plausible but not validated.
- Irreversibility dimension: Very sparse in training data (mean=0.001, most samples have 0.0). The model may underpredict this dimension for edge cases.
Ethical Considerations
- This model scores experience severity, not customer value. It should be used to prioritize support for customers with the worst experiences, not to discriminate.
- The model does not process or store personal data beyond the input text.
- Predictions should be reviewed by human agents for critical-tier cases before taking action.
Files
| File | Size | Description |
|---|---|---|
howzer_severity.onnx |
1.28 GB | ONNX model (opset 17, CPU inference) |
config.json |
1 KB | Model configuration and metadata |
inference.py |
4 KB | Python inference helper class |
Technical Specifications
| Spec | Value |
|---|---|
| ONNX opset | 17 |
Input: input_ids |
int64 [batch, 128] |
Input: attention_mask |
int64 [batch, 128] |
Output: severity_score |
float32 [batch, 1] |
Output: dimensions |
float32 [batch, 4] |
Output: tier_logits |
float32 [batch, 4] |
| Inference latency (CPU) | ~320ms / sample |
| PyTorch vs ONNX max diff | 1.9e-6 |
Citation
@misc{howzer-severity-transformer-2026,
title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
author={Lennard Gross},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/lennarddaw/howzer-severity-transformer},
}
Acknowledgments
- deepset/gbert-large β German BERT backbone
- ADAC β Customer feedback data domain
- Downloads last month
- 74
Model tree for lennarddaw/howzer-severity-transformer
Base model
deepset/gbert-largeEvaluation results
- F1 (weighted)self-reported0.961
- Accuracyself-reported0.961
- MAEself-reported0.031