HowzerRiskScorer β€” Customer Feedback Risk Assessment

A lightweight MLP model (80K parameters) that scores customer feedback risk across multiple dimensions. Designed for German-language customer service contexts (e.g., automobile clubs, insurance).

Model Description

HowzerRiskScorer replaces a 2,000-line rule-based risk engine with a learned model that takes pre-computed NLP pipeline features and outputs multi-dimensional risk scores.

Architecture: MLP with BatchNorm + GELU + Dropout (256 β†’ 128 β†’ 64) with multi-head outputs.

Property Value
Parameters 79,978
Input 120-dim feature vector
Training data 4,351 German customer feedback samples
Training framework PyTorch (manual loop)
Export formats ONNX, SafeTensors

Input Features (120 dimensions)

The model expects a pre-computed feature vector from an upstream NLP pipeline:

Group Dims Source
Sentiment scores + confidence 4 Sentiment classifier (e.g., german-sentiment-bert)
Emotion scores (27 GoEmotions) 27 Emotion classifier (e.g., mDeBERTa)
Root cause (top-5 scores + one-hot) 35 Root cause analyzer (e.g., E5 embeddings)
Customer history 10 CRM system
Channel (one-hot) 9 Input metadata
Text metadata 6 Text statistics
Derived interaction features 29 Computed from above

Outputs

Output Shape Range Description
global_risk (1,) 0.0-1.0 Overall risk score
tier_logits (4,) logits Risk tier: low/medium/high/critical
escalation (1,) 0.0-1.0 Escalation likelihood
churn (1,) 0.0-1.0 Customer churn risk
brand (1,) 0.0-1.0 Brand/reputation risk
revenue (1,) 0.0-1.0 Financial/revenue risk
confidence (1,) 0.0-1.0 Calibrated prediction confidence

Tier thresholds: critical β‰₯ 0.88, high β‰₯ 0.72, medium β‰₯ 0.38, low < 0.38

Performance

Evaluated on a held-out test set (437 samples) of German customer feedback:

Metric MLP v0.4 XGBoost Target
Tier F1 (weighted) 0.850 0.860 > 0.80 βœ…
Tier F1 (critical) 0.250 0.222 > 0.70 ❌
Score MAE 0.078 0.082 < 0.08 βœ…
Score MSE 0.023 0.021 < 0.02 ❌
Calibration ECE 0.020 0.051 < 0.05 βœ…

Per-tier F1

Tier F1 Score Test Samples
Low 0.923 365
Medium 0.174 19
High 0.649 45
Critical 0.250 8

Usage

ONNX Runtime (recommended)

import numpy as np
import onnxruntime as ort

session = ort.InferenceSession("howzer_risk_scorer.onnx")
features = np.random.randn(1, 120).astype(np.float32)  # your feature vector

outputs = session.run(None, {"features": features})
escalation, churn, brand, revenue, global_risk, tier_logits, confidence = outputs

tier_names = ["low", "medium", "high", "critical"]
predicted_tier = tier_names[tier_logits[0].argmax()]
print(f"Risk: {global_risk[0,0]:.3f}, Tier: {predicted_tier}")

Python Inference Helper

from inference import RiskScorer

scorer = RiskScorer.from_onnx("howzer_risk_scorer.onnx")
result = scorer.predict(features)  # returns RiskAssessment dataclass

print(f"Global risk: {result.global_risk:.3f}")
print(f"Tier: {result.tier} (confidence: {result.confidence:.3f})")
print(f"Components: esc={result.escalation:.3f}, churn={result.churn:.3f}")

PyTorch

import torch
from src.model.risk_scorer import HowzerRiskScorer

model = HowzerRiskScorer(input_dim=120, hidden_dims=(256, 128, 64))
model.load_state_dict(torch.load("model.safetensors"))
model.eval()

with torch.no_grad():
    preds = model(torch.randn(1, 120))
    print(f"Risk: {preds['global_risk'].item():.3f}")

Training Details

  • Data: 4,351 unique German customer feedback samples (4,000 from benchmark corpus + 351 from curated golden set)
  • Silver labels: Generated by rule-based risk engine v6.5.2
  • Oversampling: sqrt strategy for class imbalance (critical tier: 1.7% of data)
  • Loss: Focal loss (Ξ³=1.5) + MSE score regression + BCE confidence calibration
  • Optimizer: AdamW (lr=5e-4, weight_decay=1e-2)
  • Schedule: Cosine annealing with linear warmup
  • Early stopping: Patience 25, best checkpoint by validation loss
  • Hardware: AMD iGPU via DirectML

Benchmark: MLP vs LLMs

We compared the 80K-parameter MLP against general-purpose LLMs on the same risk scoring task. LLMs received the raw German feedback text and were prompted to produce structured risk assessments.

Model Params F1 (weighted) Accuracy Score MAE Latency/sample
MLP v0.4 80K 0.843 0.880 0.084 ~0.001s
XGBoost - 0.852 0.880 0.082 ~0.01s
qwen2.5:7b 7B 0.083 0.080 0.614 27s
llama3-german-instruct 8B 0.136 0.140 0.568 35s

Key findings:

  • The domain-specific MLP achieves 10x higher accuracy than 7-8B parameter LLMs
  • LLMs are 27,000x slower per inference
  • LLMs interpret "risk" differently than the calibrated rule engine, resulting in scores far from ground truth (MAE ~0.6 vs 0.08)
  • This demonstrates the value of domain-specific fine-tuning over general-purpose LLMs for structured prediction tasks

Benchmark on 50 German customer feedback samples. LLMs were prompted with text only; MLP used pre-computed pipeline features.

Limitations

  • Critical tier performance is limited (F1=0.25) due to data scarcity β€” only 74 critical samples in the entire dataset
  • Requires pre-computed NLP features (sentiment, emotion, root cause) β€” cannot process raw text directly
  • Trained on ADAC (German automobile club) customer feedback β€” domain transfer to other contexts not validated
  • Silver labels come from rule engine, not human annotators

Files

File Size Description
howzer_risk_scorer.onnx 330 KB ONNX model (opset 17)
model.safetensors 319 KB PyTorch weights
config.json 1 KB Model configuration
inference.py 4 KB Python inference helper

Citation

@misc{howzer-risk-scorer-2026,
  title={HowzerRiskScorer: Learned Customer Feedback Risk Assessment},
  year={2026},
  publisher={Hugging Face},
}
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support