HowzerSeverityTransformer β€” German Text-to-Severity Assessment

A fine-tuned deepset/gbert-large model (336M parameters) that scores the severity of German customer feedback directly from raw text. No upstream NLP pipeline needed β€” text in, severity scores out.

Replaces a 550-line rule-based engine (34 regex patterns, static weights, formula-based scoring) with a single end-to-end transformer that captures semantic nuance regex fundamentally cannot.

Why This Model Exists

Rule-based severity engines fail on paraphrasing, implicit harm, and contextual severity:

Input (German) Rule Engine This Model Why
"Hat mich ein kleines VermΓΆgen gekostet" low (no € amount) high Understands "small fortune" = expensive
"Kind stand 2 Stunden im Regen an der Autobahn" medium (partial pattern match) critical Grasps child + highway + rain = danger
"Ewig lang hingezogen, unfassbar" low (no hour count) high Captures extreme frustration semantically
"Super Service, alles bestens!" low low Correctly identifies positive feedback

Model Description

Architecture

HowzerSeverityTransformer Architecture

Property Value
Base model deepset/gbert-large (German BERT, 335M params)
Total parameters 336,057,225
Trainable (Phase 2) ~26M (last 2 BERT layers + encoder + heads)
Input Raw German text (max 128 tokens)
Training data 10,000 German customer feedback samples (silver labels)
Training framework PyTorch (manual training loop, no HF Trainer)
Training hardware AMD iGPU via DirectML
Training time ~5h Phase 1 (frozen) + ~20h Phase 2 (fine-tuning)
Export formats ONNX (opset 17), PyTorch checkpoint

Outputs

Output Shape Range Description
severity_score (1,) 0.0 - 1.0 Composite severity score
dimensions (4,) 0.0 - 1.0 each 4 severity dimensions (see below)
tier_logits (4,) logits Severity tier classification

Severity Dimensions

Dimension Description Example Signals
Emotional Intensity How strongly negative are the emotions? Anger, disgust, frustration, extreme language
Impact Severity Financial, safety, or data loss? Euro amounts, injuries, data deletion
Service Failure SLA breach, repeated failures? Hours waiting, multiple contacts, broken promises
Irreversibility Can the damage be undone? Total loss, missed events, permanent data loss

Severity Tiers

Tier Score Range Description Example
low < 0.25 Minor inconvenience or positive feedback "App ist manchmal langsam"
medium 0.25 - 0.40 Noticeable issue, customer unhappy "Wartezeit am Telefon war zu lang"
high 0.40 - 0.60 Significant problem, potential churn risk "Versicherung hat Schaden abgelehnt"
critical >= 0.60 Severe: safety, legal, data risk, extreme harm "Kind stand nachts an der Autobahn, keine Hilfe"

Performance

Test Set (1,500 held-out samples, natural distribution)

Metric Value Target Status
Tier F1 (weighted) 0.961 > 0.82 Exceeded
Tier F1 (macro) 0.958 β€” β€”
Tier F1 (critical) 1.000 > 0.50 Perfect
Score MAE 0.031 < 0.08 2.5x better
Score MSE 0.002 < 0.015 7.5x better
Dimension MAE (avg) 0.050 < 0.10 2x better
Critical→Low misclass 0 0 PASS

Per-Tier F1

Tier Precision Recall F1 Test Samples
Low 0.99 0.96 0.972 774
Medium 0.94 0.97 0.954 636
High 0.88 0.93 0.905 87
Critical 1.00 1.00 1.000 3

Confusion Matrix

             Predicted
             low    medium   high   critical
True low      742      32      0        0
True medium    10     615     11        0
True high       0       6     81        0
True critical   0       0      0        3

Golden Set (449 curated hard samples, oversampled minorities)

Metric Value
Tier F1 (weighted) 0.960
Tier F1 (critical) 1.000
Tier F1 (high) 0.970
Score MAE 0.034
Critical→Low 0

Golden set performance matches test set β€” no overfitting to easy cases.

Per-Dimension MAE

Dimension Test MAE Golden MAE
Emotional Intensity 0.060 0.073
Impact Severity 0.072 0.100
Service Failure 0.065 0.095
Irreversibility 0.005 0.025
Average 0.050 0.073

Benchmark: Fine-Tuned Model vs Large Language Models

We compared this 336M-parameter fine-tuned model against general-purpose LLMs and off-the-shelf NLP models on the same 48 stratified test samples (24 low, 19 medium, 2 high, 3 critical). LLMs received raw German text with a detailed system prompt explaining the scoring dimensions and tier thresholds, and were asked to return structured JSON with severity scores. Zero-shot NLI and sentiment models were adapted with heuristic mappings.

Summary

Benchmark Summary

Overall Comparison

F1 Score Comparison

MAE Comparison

Detailed Metrics Table

Model Type Params F1 (w) F1 (macro) Accuracy MAE Crit F1 C->L
HowzerSeverity (ours) fine-tuned 336M 1.000 1.000 100.0% 0.030 1.000 0
Claude Sonnet 4.6 LLM (cloud) ~70B? 0.981 0.943 97.9% 0.006 1.000 0
Claude Opus 4.6 LLM (cloud) ~70B? 0.885 0.818 87.5% 0.065 1.000 0
Claude Haiku 4.5 LLM (cloud) ~8B? 0.811 0.779 81.2% 0.038 1.000 0
mDeBERTa XNLI zero-shot NLI 278M 0.456 0.284 43.8% 0.163 0.000 0
nlptown Star Rating sentiment-map 110M 0.429 0.305 37.5% 0.212 0.353 0
German Sentiment BERT sentiment-map 110M 0.287 0.170 27.1% 0.190 0.000 0
BART MNLI zero-shot NLI 406M 0.211 0.143 16.7% 0.234 0.133 0

F1 (w) = weighted F1 across tiers | MAE = mean absolute error on severity score [0-1] | Crit F1 = F1 on critical tier | C->L = critical samples misclassified as low (safety metric)

Per-Tier F1 Breakdown

Per-Tier F1 Heatmap

Key observations:

  • All models struggle most with high tier (only 2 samples in test set; subtle boundary between medium and high)
  • All Claude models nail critical (F1=1.000), but off-the-shelf NLP models mostly fail here
  • Our model is the only one with perfect F1 across all four tiers

Confusion Matrices

Confusion Matrices

Why LLMs Struggle with Severity Scoring

Even the best LLMs make systematic errors that our fine-tuned model avoids:

Error Pattern Example (German) LLM Says Ours Says Why
Over-scoring complaints "2,5 Stunden Wartezeit auf der A3!" high (0.40) medium (0.30) Trained on calibrated tier thresholds β€” long wait without escalation signals = medium
Confusing cancellation with severity "Wegen Klimaziele gekΓΌndigt" high (0.45) medium (0.32) Policy disagreement β‰  service failure; learned churn signal weight
Missing positive resolution "Panne, aber ADAC hat gut koordiniert" medium (0.30) low (0.08) Understands that resolved issues have low severity
Misreading mixed sentiment "40 Jahre Mitglied, nie enttΓ€uscht" medium (0.35) medium (0.39) Both get this β€” but LLMs often over-index on emotion words

Cost & Latency Comparison

Cost vs F1 Comparison

Model Latency Cost / 1K samples Offline Privacy
HowzerSeverity (ours) ~306ms $0 Yes Yes
Claude Sonnet 4.6 ~2-5s ~$9.00 No No
Claude Opus 4.6 ~3-8s ~$45.00 No No
Claude Haiku 4.5 ~1-3s ~$3.00 No No
mDeBERTa XNLI ~560ms $0 Yes Yes
German Sentiment BERT ~55ms $0 Yes Yes

Key Takeaways

1. A fine-tuned 336M model beats all tested LLMs on domain-specific scoring. Our model achieves F1=1.000 on 48 stratified samples. The closest LLM (Claude Sonnet 4.6, ~70B+ params) reaches F1=0.981 β€” impressive, but at 200x the parameters, ~$9/1K samples, and mandatory cloud dependency.

2. Zero-shot approaches fundamentally fail. NLI models (mDeBERTa, BART) and sentiment models (German BERT, nlptown) score F1 < 0.46 β€” no amount of prompt engineering can teach tier boundaries that require domain-calibrated training data.

3. LLMs get the extremes right but blur the middle. All Claude models achieve perfect critical-tier F1, but struggle with the medium/high boundary where domain-specific calibration matters most. This is exactly where customer service triage decisions are made.

4. You don't need a 100B+ parameter general-purpose LLM for structured scoring tasks. A 336M fine-tuned model runs 10-25x faster, costs $0 at inference, works fully offline with complete data privacy, and still outperforms models 200x its size.

Usage

ONNX Runtime (recommended for production)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# Load
session = ort.InferenceSession("howzer_severity.onnx")
tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-large")

# Tokenize
text = "Pannenhilfe kam nach 4 Stunden immer noch nicht, Kind im Auto."
enc = tokenizer(text, max_length=128, padding="max_length",
                truncation=True, return_tensors="np")

# Inference
outputs = session.run(None, {
    "input_ids": enc["input_ids"].astype(np.int64),
    "attention_mask": enc["attention_mask"].astype(np.int64),
})

severity_score = outputs[0].squeeze()   # 0.0 - 1.0
dimensions = outputs[1].squeeze()       # [emotional, impact, service, irreversibility]
tier_logits = outputs[2].squeeze()      # 4-class logits

tier_names = ["low", "medium", "high", "critical"]
tier = tier_names[tier_logits.argmax()]

print(f"Severity: {severity_score:.3f} ({tier})")
print(f"  Emotional:      {dimensions[0]:.3f}")
print(f"  Impact:         {dimensions[1]:.3f}")
print(f"  Service:        {dimensions[2]:.3f}")
print(f"  Irreversibility: {dimensions[3]:.3f}")
# Output: Severity: 0.582 (high)

Python Inference Helper

from inference import SeverityScorer

scorer = SeverityScorer.from_onnx("howzer_severity.onnx")
result = scorer.predict("Der ADAC hat mich komplett im Stich gelassen!")

print(f"Score: {result.severity_score:.3f}")
print(f"Tier: {result.tier}")
print(f"Dimensions: {result.dimensions}")

PyTorch (for research / fine-tuning)

import torch
from transformers import AutoTokenizer, AutoModel

# Load backbone + custom heads
# (requires src/severity/model.py from the training repo)
from src.severity.model import HowzerSeverityTransformer

model = HowzerSeverityTransformer(backbone_name="deepset/gbert-large")
ckpt = torch.load("phase2_best.pt", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-large")
enc = tokenizer("Absolute Frechheit!", max_length=128,
                padding="max_length", truncation=True, return_tensors="pt")

with torch.no_grad():
    out = model(enc["input_ids"], enc["attention_mask"])
    print(f"Score: {out['severity_score'].item():.3f}")

Training Details

Two-Phase Strategy

Phase 1 β€” Head Training (backbone frozen)

  • Freeze all 335M gbert-large parameters
  • Train only: severity encoder + 3 output heads (~312K params)
  • 15 epochs, LR=1e-3 with cosine warmup
  • Best: epoch 10, tier accuracy 79.2%, score MAE 0.053

Phase 2 β€” Fine-tuning (last 2 BERT layers unfrozen)

  • Unfreeze last 2 of 24 transformer layers + pooler (~26M trainable params)
  • 25 epochs, LR=2e-5 with cosine warmup
  • Best: epoch 14, tier accuracy 96.7%, score MAE 0.030
  • Improvement over Phase 1: +17.5pp accuracy, -43% MAE

Loss Function

Multi-objective loss combining all three heads:

L = 1.5 * MSE(severity_score) + 0.8 * MSE(dimensions) + 3.0 * FocalLoss(tier_logits)
  • Focal Loss (gamma=1.5) for tier classification handles class imbalance
  • Class weights: [1.0, 1.2, 9.0, 50.0] for [low, medium, high, critical]

Data

  • 10,000 German customer feedback samples (ADAC automobile club reviews)
  • Silver labels generated by running the rule-based severity engine with full upstream pipeline (sentiment BERT + emotion keywords + root cause keywords + intent regex)
  • Tier distribution: 51.6% low, 42.4% medium, 5.8% high, 0.2% critical
  • Sqrt oversampling for minority tiers in training set (high ~3x, critical ~20x)
  • Stratified splits: 70% train / 15% val / 15% test
  • 449 golden labels with curated edge cases for robust evaluation

Optimizer

  • AdamW (betas=0.9/0.999, weight_decay=1e-2)
  • Cosine annealing with linear warmup
  • Gradient clipping at 1.0
  • Gradient accumulation: 2 steps (Phase 1), 8 steps (Phase 2) for effective batch size 32

Intended Use

  • Primary: Scoring severity of German customer feedback in customer service / CRM pipelines
  • Domain: Automotive clubs, insurance, service providers (trained on ADAC data)
  • Languages: German only (the backbone and training data are German-specific)

Out of Scope

  • Non-German text (model has not seen any non-German training data)
  • General sentiment analysis (this scores severity of bad experience, not positive/negative)
  • Non-customer-feedback text (academic writing, news, social media etc.)

Limitations

  • Silver labels from rule engine: The training labels were generated by a rule-based system, not human annotators. The model has learned to match and exceed the rule engine but may inherit systematic biases from it.
  • Critical tier data scarcity: Only 19 critical-tier samples in 10K records (0.2%). Critical F1=1.0 on test set but only 3 test samples β€” more validation needed.
  • ADAC domain specificity: Trained on automobile club customer feedback. Transfer to other German-language customer service domains is plausible but not validated.
  • Irreversibility dimension: Very sparse in training data (mean=0.001, most samples have 0.0). The model may underpredict this dimension for edge cases.

Ethical Considerations

  • This model scores experience severity, not customer value. It should be used to prioritize support for customers with the worst experiences, not to discriminate.
  • The model does not process or store personal data beyond the input text.
  • Predictions should be reviewed by human agents for critical-tier cases before taking action.

Files

File Size Description
howzer_severity.onnx 1.28 GB ONNX model (opset 17, CPU inference)
config.json 1 KB Model configuration and metadata
inference.py 4 KB Python inference helper class

Technical Specifications

Spec Value
ONNX opset 17
Input: input_ids int64 [batch, 128]
Input: attention_mask int64 [batch, 128]
Output: severity_score float32 [batch, 1]
Output: dimensions float32 [batch, 4]
Output: tier_logits float32 [batch, 4]
Inference latency (CPU) ~320ms / sample
PyTorch vs ONNX max diff 1.9e-6

Citation

@misc{howzer-severity-transformer-2026,
  title={HowzerSeverityTransformer: End-to-End German Text-to-Severity Assessment},
  author={Lennard Gross},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/lennarddaw/howzer-severity-transformer},
}

Acknowledgments

Downloads last month
74
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lennarddaw/howzer-severity-transformer

Quantized
(1)
this model

Evaluation results