HowzerRiskScorer β Customer Feedback Risk Assessment
A lightweight MLP model (80K parameters) that scores customer feedback risk across multiple dimensions. Designed for German-language customer service contexts (e.g., automobile clubs, insurance).
Model Description
HowzerRiskScorer replaces a 2,000-line rule-based risk engine with a learned model that takes pre-computed NLP pipeline features and outputs multi-dimensional risk scores.
Architecture: MLP with BatchNorm + GELU + Dropout (256 β 128 β 64) with multi-head outputs.
| Property | Value |
|---|---|
| Parameters | 79,978 |
| Input | 120-dim feature vector |
| Training data | 4,351 German customer feedback samples |
| Training framework | PyTorch (manual loop) |
| Export formats | ONNX, SafeTensors |
Input Features (120 dimensions)
The model expects a pre-computed feature vector from an upstream NLP pipeline:
| Group | Dims | Source |
|---|---|---|
| Sentiment scores + confidence | 4 | Sentiment classifier (e.g., german-sentiment-bert) |
| Emotion scores (27 GoEmotions) | 27 | Emotion classifier (e.g., mDeBERTa) |
| Root cause (top-5 scores + one-hot) | 35 | Root cause analyzer (e.g., E5 embeddings) |
| Customer history | 10 | CRM system |
| Channel (one-hot) | 9 | Input metadata |
| Text metadata | 6 | Text statistics |
| Derived interaction features | 29 | Computed from above |
Outputs
| Output | Shape | Range | Description |
|---|---|---|---|
global_risk |
(1,) | 0.0-1.0 | Overall risk score |
tier_logits |
(4,) | logits | Risk tier: low/medium/high/critical |
escalation |
(1,) | 0.0-1.0 | Escalation likelihood |
churn |
(1,) | 0.0-1.0 | Customer churn risk |
brand |
(1,) | 0.0-1.0 | Brand/reputation risk |
revenue |
(1,) | 0.0-1.0 | Financial/revenue risk |
confidence |
(1,) | 0.0-1.0 | Calibrated prediction confidence |
Tier thresholds: critical β₯ 0.88, high β₯ 0.72, medium β₯ 0.38, low < 0.38
Performance
Evaluated on a held-out test set (437 samples) of German customer feedback:
| Metric | MLP v0.4 | XGBoost | Target |
|---|---|---|---|
| Tier F1 (weighted) | 0.850 | 0.860 | > 0.80 β |
| Tier F1 (critical) | 0.250 | 0.222 | > 0.70 β |
| Score MAE | 0.078 | 0.082 | < 0.08 β |
| Score MSE | 0.023 | 0.021 | < 0.02 β |
| Calibration ECE | 0.020 | 0.051 | < 0.05 β |
Per-tier F1
| Tier | F1 Score | Test Samples |
|---|---|---|
| Low | 0.923 | 365 |
| Medium | 0.174 | 19 |
| High | 0.649 | 45 |
| Critical | 0.250 | 8 |
Usage
ONNX Runtime (recommended)
import numpy as np
import onnxruntime as ort
session = ort.InferenceSession("howzer_risk_scorer.onnx")
features = np.random.randn(1, 120).astype(np.float32) # your feature vector
outputs = session.run(None, {"features": features})
escalation, churn, brand, revenue, global_risk, tier_logits, confidence = outputs
tier_names = ["low", "medium", "high", "critical"]
predicted_tier = tier_names[tier_logits[0].argmax()]
print(f"Risk: {global_risk[0,0]:.3f}, Tier: {predicted_tier}")
Python Inference Helper
from inference import RiskScorer
scorer = RiskScorer.from_onnx("howzer_risk_scorer.onnx")
result = scorer.predict(features) # returns RiskAssessment dataclass
print(f"Global risk: {result.global_risk:.3f}")
print(f"Tier: {result.tier} (confidence: {result.confidence:.3f})")
print(f"Components: esc={result.escalation:.3f}, churn={result.churn:.3f}")
PyTorch
import torch
from src.model.risk_scorer import HowzerRiskScorer
model = HowzerRiskScorer(input_dim=120, hidden_dims=(256, 128, 64))
model.load_state_dict(torch.load("model.safetensors"))
model.eval()
with torch.no_grad():
preds = model(torch.randn(1, 120))
print(f"Risk: {preds['global_risk'].item():.3f}")
Training Details
- Data: 4,351 unique German customer feedback samples (4,000 from benchmark corpus + 351 from curated golden set)
- Silver labels: Generated by rule-based risk engine v6.5.2
- Oversampling: sqrt strategy for class imbalance (critical tier: 1.7% of data)
- Loss: Focal loss (Ξ³=1.5) + MSE score regression + BCE confidence calibration
- Optimizer: AdamW (lr=5e-4, weight_decay=1e-2)
- Schedule: Cosine annealing with linear warmup
- Early stopping: Patience 25, best checkpoint by validation loss
- Hardware: AMD iGPU via DirectML
Benchmark: MLP vs LLMs
We compared the 80K-parameter MLP against general-purpose LLMs on the same risk scoring task. LLMs received the raw German feedback text and were prompted to produce structured risk assessments.
| Model | Params | F1 (weighted) | Accuracy | Score MAE | Latency/sample |
|---|---|---|---|---|---|
| MLP v0.4 | 80K | 0.843 | 0.880 | 0.084 | ~0.001s |
| XGBoost | - | 0.852 | 0.880 | 0.082 | ~0.01s |
| qwen2.5:7b | 7B | 0.083 | 0.080 | 0.614 | 27s |
| llama3-german-instruct | 8B | 0.136 | 0.140 | 0.568 | 35s |
Key findings:
- The domain-specific MLP achieves 10x higher accuracy than 7-8B parameter LLMs
- LLMs are 27,000x slower per inference
- LLMs interpret "risk" differently than the calibrated rule engine, resulting in scores far from ground truth (MAE ~0.6 vs 0.08)
- This demonstrates the value of domain-specific fine-tuning over general-purpose LLMs for structured prediction tasks
Benchmark on 50 German customer feedback samples. LLMs were prompted with text only; MLP used pre-computed pipeline features.
Limitations
- Critical tier performance is limited (F1=0.25) due to data scarcity β only 74 critical samples in the entire dataset
- Requires pre-computed NLP features (sentiment, emotion, root cause) β cannot process raw text directly
- Trained on ADAC (German automobile club) customer feedback β domain transfer to other contexts not validated
- Silver labels come from rule engine, not human annotators
Files
| File | Size | Description |
|---|---|---|
howzer_risk_scorer.onnx |
330 KB | ONNX model (opset 17) |
model.safetensors |
319 KB | PyTorch weights |
config.json |
1 KB | Model configuration |
inference.py |
4 KB | Python inference helper |
Citation
@misc{howzer-risk-scorer-2026,
title={HowzerRiskScorer: Learned Customer Feedback Risk Assessment},
year={2026},
publisher={Hugging Face},
}
- Downloads last month
- 21