Email Classifier (mmBERT-small ONNX, v8)
A dual-head mmBERT-small classifier for multilingual email category + action prediction, optimized for on-device inference using ONNX Runtime.
Model Description
Classifies emails into 6 categories and predicts whether action is required:
| Category | Description |
|---|---|
| PERSONAL | 1:1 human communication, social messages, direct correspondence |
| NEWSLETTER | Subscribed editorial/digest content (curated articles, weekly roundups) |
| PROMOTIONAL | Marketing pushes, sales, discount offers, product launches |
| TRANSACTION | Orders, receipts, payments, shipping confirmations |
| ALERT | Security notices, account warnings, important notifications |
| SOCIAL | Social network notifications, community updates, reactions |
The PROMOTIONAL β NEWSLETTER split is new in v8. v6 (MiniLM) lumped both into NEWSLETTER, which made downstream filtering noisy.
Output Format
Single forward pass producing two tensors:
category_probs: Float32[6] β softmax probabilities per category (argmax = predicted category)action_prob: Float32[1] β sigmoid probability of action required (threshold 0.5)
No text generation, no decoder, no beam search.
Example:
Input: "Subject: Your order has shipped\n\nBody: Your order #12345 is on its way..."
Output: category_probs β TRANSACTION (0.96), action_prob β 0.08 (NO_ACTION)
Intended Use
- Primary: On-device email triage in multilingual mobile apps (iOS/Android)
- Runtime: ONNX Runtime React Native
- Use case: Prioritizing inbox, filtering noise, surfacing actionable emails β for English- and French-speaking users
Model Details
| Attribute | Value |
|---|---|
| Base Model | jhu-clsp/mmBERT-small (ModernBERT family, multilingual) |
| Parameters | ~140M |
| Architecture | mmBERT encoder (RoPE + GeGLU + alternating local/global attention) + dual classification heads |
| Pooling | Mean pooling over last_hidden_state (masked) |
| ONNX Size | 135.3 MB (INT8 dynamic per-channel quantized) |
| Max Sequence | 384 tokens |
| Tokenizer | Gemma 2 BPE (256K vocab) |
| Opset | 14 |
Performance
Evaluated on a held-out 321-sample multilingual test set (md5 37bbd4c08ae9338890ad5cc2656b5e6f):
| Metric | Score |
|---|---|
| Category Accuracy (overall) | 93.15% |
| Category Accuracy (English) | 93.49% |
| Category Accuracy (French) | 92.45% |
| Action Accuracy (overall) | 92.83% |
| Argmax-match vs PyTorch FP32 | 96.57% |
| Quantization | INT8 dynamic, per-channel (4Γ compression vs FP32) |
Per-class Recall
| Class | n | Recall |
|---|---|---|
| ALERT | 60 | 85.00% |
| NEWSLETTER | 50 | 94.00% |
| PERSONAL | 50 | 98.00% |
| PROMOTIONAL | 60 | 95.00% |
| SOCIAL | 41 | 87.80% |
| TRANSACTION | 60 | 98.33% |
Multilingual stability: FR cat_acc holds within 1.04pp of EN, indicating the cross-lingual encoder generalizes evenly across the trained languages.
Training Data
- Source: Personal Gmail inboxes (anonymized)
- Languages: English, French (joint-stratified balance by category Γ language)
- Labeling: Human-annotated with category + action flag
- Class weights: Gentle (max 1.317, min 0.891) β joint-stratified weighting prevents class collapse under quantization
- Input format:
Subject: ...\n\nBody: ...(no instruction prefix)
How to Use
ONNX Runtime (React Native)
import { InferenceSession, Tensor } from 'onnxruntime-react-native';
const session = await InferenceSession.create('model.onnx');
// Two inputs only β NO token_type_ids (mmBERT does not use segment embeddings)
const outputs = await session.run({
input_ids: inputIdsTensor, // int64[1, S], S β€ 384
attention_mask: attentionMaskTensor, // int64[1, S]
});
const categoryProbs = outputs.category_probs.data; // Float32[6]
const actionProb = outputs.action_prob.data[0]; // Float32
const CATEGORIES = ['ALERT', 'NEWSLETTER', 'PERSONAL', 'PROMOTIONAL', 'SOCIAL', 'TRANSACTION'];
const category = CATEGORIES[categoryProbs.indexOf(Math.max(...categoryProbs))];
const actionRequired = actionProb > 0.5;
Python (PyTorch)
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("Ippoboi/mmbert-s-email-classifier")
# Load DualHeadClassifier (mean pooling + dual heads) from checkpoint
# (see ml/scripts/train_classifier.py for the head architecture)
text = "Subject: Meeting tomorrow\n\nBody: Can we reschedule to 3pm?"
inputs = tokenizer(text, return_tensors="pt", max_length=384, truncation=True)
with torch.no_grad():
cat_probs, act_prob = model(inputs["input_ids"], inputs["attention_mask"])
categories = ["ALERT", "NEWSLETTER", "PERSONAL", "PROMOTIONAL", "SOCIAL", "TRANSACTION"]
category = categories[cat_probs.argmax()]
action = act_prob.item() > 0.5
Special Tokens (Gemma 2 BPE)
mmBERT uses Gemma 2's tokenizer β IDs differ from XLM-R/MiniLM:
| Token | ID |
|---|---|
<pad> |
0 |
<eos> |
1 |
<bos> |
2 |
<unk> |
3 |
Sequence wrap: [<bos>, ...content..., <eos>]. There is no [CLS] / [SEP] β that's XLM-R territory.
Files
| File | Size | Description |
|---|---|---|
model.onnx |
135.3 MB | INT8 quantized ONNX model |
tokenizer.json |
32.8 MB | Gemma 2 BPE tokenizer (256K vocab) |
tokenizer_config.json |
45 KB | Tokenizer configuration |
special_tokens_map.json |
1 KB | Special token IDs |
export_metadata.json |
1 KB | Provenance + canonical metrics |
Architecture
Input β mmBERT Encoder (22 layers, 384 hidden, RoPE + GeGLU)
β
Mean-pool over last_hidden_state (masked by attention_mask)
β
βββββββ΄ββββββ
β β
Category Head Action Head
Linear(384β6) Linear(384β1)
β β
softmax sigmoid
Compared to Previous Model (MiniLM v6)
| MiniLM v6 | mmBERT-small v8 (this) | |
|---|---|---|
| Base | XLM-R MiniLM-L12 | mmBERT-small (ModernBERT family) |
| Schema | 5 classes | 6 classes (PROMOTIONAL added) |
| Languages tracked | English-dominant | English + French (balanced) |
| Vocab | 250K SentencePiece Unigram | 256K Gemma 2 BPE |
| Max sequence | 256 | 384 |
| Inputs | input_ids, attention_mask, token_type_ids | input_ids, attention_mask |
| Bundle size | 113 MB | 135 MB |
| Cat acc | ~92.0% (5-class) | 93.15% (6-class, harder schema) |
| FR cat acc | not tracked | 92.45% |
Limitations
- Trained on English and French only; may not generalize to other languages despite the multilingual base
- Personal/consumer email patterns; may not generalize to enterprise/corporate email
- The PROMOTIONAL β NEWSLETTER decision boundary is genuinely fuzzy; expect some legitimate disagreements with human raters at this boundary
- Action accuracy lost ~2.2pp under INT8 quantization vs FP32 (95.02% β 92.83%) β the action head is a single Linear(384β1) and is more quantization-sensitive than the 6-class softmax
- 256K vocab tokenizer is oversized for an EN+FR-only deployment but is required to use the pretrained mmBERT weights without retraining
Notes on Quantization
This model uses INT8 dynamic per-channel quantization via onnxruntime.quantization.quantize_dynamic(weight_type=QInt8, per_channel=True) on a clean FP32 ONNX export. Two export-time fixes were required to preserve accuracy through quantization on this architecture:
- The export wrapper drops the unused
token_type_idspath (mmBERT has no segment embeddings; an unused embedding lookup contaminates the shared 256K-vocab embedding's scale calibration). model.encoder.config.reference_compile = Falseis set beforetorch.onnx.export(..., dynamo=False)so the legacy tracer can trace through thetok_embeddingslookup directly instead of thecompiled_embeddingstorch.compileshim.
With both fixes, INT8 dynamic per-channel quantization preserves the FP32 cat_acc essentially exactly. Static-INT8 calibration (percentile/entropy) was attempted but is empirically infeasible on ModernBERT-style graphs in ORT 1.24 due to peak-RAM blowup during calibration of wide-FFN and full-attention activation tensors.
License
Apache 2.0
Model tree for Ippoboi/mmbert-s-email-classifier
Base model
jhu-clsp/mmBERT-small