HowzerRiskScorer — Customer Feedback Risk Assessment

A lightweight MLP model (80K parameters) that scores customer feedback risk across multiple dimensions. Designed for German-language customer service contexts (e.g., automobile clubs, insurance).

Model Description

HowzerRiskScorer replaces a 2,000-line rule-based risk engine with a learned model that takes pre-computed NLP pipeline features and outputs multi-dimensional risk scores.

Architecture: MLP with BatchNorm + GELU + Dropout (256 → 128 → 64) with multi-head outputs.

Property	Value
Parameters	79,978
Input	120-dim feature vector
Training data	4,351 German customer feedback samples
Training framework	PyTorch (manual loop)
Export formats	ONNX, SafeTensors

Input Features (120 dimensions)

The model expects a pre-computed feature vector from an upstream NLP pipeline:

Group	Dims	Source
Sentiment scores + confidence	4	Sentiment classifier (e.g., german-sentiment-bert)
Emotion scores (27 GoEmotions)	27	Emotion classifier (e.g., mDeBERTa)
Root cause (top-5 scores + one-hot)	35	Root cause analyzer (e.g., E5 embeddings)
Customer history	10	CRM system
Channel (one-hot)	9	Input metadata
Text metadata	6	Text statistics
Derived interaction features	29	Computed from above

Outputs

Output	Shape	Range	Description
`global_risk`	(1,)	0.0-1.0	Overall risk score
`tier_logits`	(4,)	logits	Risk tier: low/medium/high/critical
`escalation`	(1,)	0.0-1.0	Escalation likelihood
`churn`	(1,)	0.0-1.0	Customer churn risk
`brand`	(1,)	0.0-1.0	Brand/reputation risk
`revenue`	(1,)	0.0-1.0	Financial/revenue risk
`confidence`	(1,)	0.0-1.0	Calibrated prediction confidence

Tier thresholds: critical ≥ 0.88, high ≥ 0.72, medium ≥ 0.38, low < 0.38

Performance

Evaluated on a held-out test set (437 samples) of German customer feedback:

Metric	MLP v0.4	XGBoost	Target
Tier F1 (weighted)	0.850	0.860	> 0.80 ✅
Tier F1 (critical)	0.250	0.222	> 0.70 ❌
Score MAE	0.078	0.082	< 0.08 ✅
Score MSE	0.023	0.021	< 0.02 ❌
Calibration ECE	0.020	0.051	< 0.05 ✅

Per-tier F1

Tier	F1 Score	Test Samples
Low	0.923	365
Medium	0.174	19
High	0.649	45
Critical	0.250	8

Usage

ONNX Runtime (recommended)

import numpy as np
import onnxruntime as ort

session = ort.InferenceSession("howzer_risk_scorer.onnx")
features = np.random.randn(1, 120).astype(np.float32)  # your feature vector

outputs = session.run(None, {"features": features})
escalation, churn, brand, revenue, global_risk, tier_logits, confidence = outputs

tier_names = ["low", "medium", "high", "critical"]
predicted_tier = tier_names[tier_logits[0].argmax()]
print(f"Risk: {global_risk[0,0]:.3f}, Tier: {predicted_tier}")

Python Inference Helper

from inference import RiskScorer

scorer = RiskScorer.from_onnx("howzer_risk_scorer.onnx")
result = scorer.predict(features)  # returns RiskAssessment dataclass

print(f"Global risk: {result.global_risk:.3f}")
print(f"Tier: {result.tier} (confidence: {result.confidence:.3f})")
print(f"Components: esc={result.escalation:.3f}, churn={result.churn:.3f}")

PyTorch

import torch
from src.model.risk_scorer import HowzerRiskScorer

model = HowzerRiskScorer(input_dim=120, hidden_dims=(256, 128, 64))
model.load_state_dict(torch.load("model.safetensors"))
model.eval()

with torch.no_grad():
    preds = model(torch.randn(1, 120))
    print(f"Risk: {preds['global_risk'].item():.3f}")

Training Details

Data: 4,351 unique German customer feedback samples (4,000 from benchmark corpus + 351 from curated golden set)
Silver labels: Generated by rule-based risk engine v6.5.2
Oversampling: sqrt strategy for class imbalance (critical tier: 1.7% of data)
Loss: Focal loss (γ=1.5) + MSE score regression + BCE confidence calibration
Optimizer: AdamW (lr=5e-4, weight_decay=1e-2)
Schedule: Cosine annealing with linear warmup
Early stopping: Patience 25, best checkpoint by validation loss
Hardware: AMD iGPU via DirectML

Benchmark: MLP vs LLMs

We compared the 80K-parameter MLP against general-purpose LLMs on the same risk scoring task. LLMs received the raw German feedback text and were prompted to produce structured risk assessments.

Model	Params	F1 (weighted)	Accuracy	Score MAE	Latency/sample
MLP v0.4	80K	0.843	0.880	0.084	~0.001s
XGBoost	-	0.852	0.880	0.082	~0.01s
qwen2.5:7b	7B	0.083	0.080	0.614	27s
llama3-german-instruct	8B	0.136	0.140	0.568	35s

Key findings:

The domain-specific MLP achieves 10x higher accuracy than 7-8B parameter LLMs
LLMs are 27,000x slower per inference
LLMs interpret "risk" differently than the calibrated rule engine, resulting in scores far from ground truth (MAE ~0.6 vs 0.08)
This demonstrates the value of domain-specific fine-tuning over general-purpose LLMs for structured prediction tasks

Benchmark on 50 German customer feedback samples. LLMs were prompted with text only; MLP used pre-computed pipeline features.

Limitations

Critical tier performance is limited (F1=0.25) due to data scarcity — only 74 critical samples in the entire dataset
Requires pre-computed NLP features (sentiment, emotion, root cause) — cannot process raw text directly
Trained on ADAC (German automobile club) customer feedback — domain transfer to other contexts not validated
Silver labels come from rule engine, not human annotators

Files

File	Size	Description
`howzer_risk_scorer.onnx`	330 KB	ONNX model (opset 17)
`model.safetensors`	319 KB	PyTorch weights
`config.json`	1 KB	Model configuration
`inference.py`	4 KB	Python inference helper

Citation

@misc{howzer-risk-scorer-2026,
  title={HowzerRiskScorer: Learned Customer Feedback Risk Assessment},
  year={2026},
  publisher={Hugging Face},
}

Downloads last month: 7

Safetensors

Model size

80.9k params

Tensor type

F32