Email Classifier (mmBERT-small ONNX, v8)

A dual-head mmBERT-small classifier for multilingual email category + action prediction, optimized for on-device inference using ONNX Runtime.

Model Description

Classifies emails into 6 categories and predicts whether action is required:

Category	Description
PERSONAL	1:1 human communication, social messages, direct correspondence
NEWSLETTER	Subscribed editorial/digest content (curated articles, weekly roundups)
PROMOTIONAL	Marketing pushes, sales, discount offers, product launches
TRANSACTION	Orders, receipts, payments, shipping confirmations
ALERT	Security notices, account warnings, important notifications
SOCIAL	Social network notifications, community updates, reactions

The PROMOTIONAL ↔ NEWSLETTER split is new in v8. v6 (MiniLM) lumped both into NEWSLETTER, which made downstream filtering noisy.

Output Format

Single forward pass producing two tensors:

category_probs: Float32[6] — softmax probabilities per category (argmax = predicted category)
action_prob: Float32[1] — sigmoid probability of action required (threshold 0.5)

No text generation, no decoder, no beam search.

Example:

Input: "Subject: Your order has shipped\n\nBody: Your order #12345 is on its way..."
Output: category_probs → TRANSACTION (0.96), action_prob → 0.08 (NO_ACTION)

Intended Use

Primary: On-device email triage in multilingual mobile apps (iOS/Android)
Runtime: ONNX Runtime React Native
Use case: Prioritizing inbox, filtering noise, surfacing actionable emails — for English- and French-speaking users

Model Details

Attribute	Value
Base Model	`jhu-clsp/mmBERT-small` (ModernBERT family, multilingual)
Parameters	~140M
Architecture	mmBERT encoder (RoPE + GeGLU + alternating local/global attention) + dual classification heads
Pooling	Mean pooling over `last_hidden_state` (masked)
ONNX Size	135.3 MB (INT8 dynamic per-channel quantized)
Max Sequence	384 tokens
Tokenizer	Gemma 2 BPE (256K vocab)
Opset	14

Performance

Evaluated on a held-out 321-sample multilingual test set (md5 37bbd4c08ae9338890ad5cc2656b5e6f):

Metric	Score
Category Accuracy (overall)	93.15%
Category Accuracy (English)	93.49%
Category Accuracy (French)	92.45%
Action Accuracy (overall)	92.83%
Argmax-match vs PyTorch FP32	96.57%
Quantization	INT8 dynamic, per-channel (4× compression vs FP32)

Per-class Recall

Class	n	Recall
ALERT	60	85.00%
NEWSLETTER	50	94.00%
PERSONAL	50	98.00%
PROMOTIONAL	60	95.00%
SOCIAL	41	87.80%
TRANSACTION	60	98.33%

Multilingual stability: FR cat_acc holds within 1.04pp of EN, indicating the cross-lingual encoder generalizes evenly across the trained languages.

Training Data

Source: Personal Gmail inboxes (anonymized)
Languages: English, French (joint-stratified balance by category × language)
Labeling: Human-annotated with category + action flag
Class weights: Gentle (max 1.317, min 0.891) — joint-stratified weighting prevents class collapse under quantization
Input format: Subject: ...\n\nBody: ... (no instruction prefix)

How to Use

ONNX Runtime (React Native)

import { InferenceSession, Tensor } from 'onnxruntime-react-native';

const session = await InferenceSession.create('model.onnx');

// Two inputs only — NO token_type_ids (mmBERT does not use segment embeddings)
const outputs = await session.run({
  input_ids: inputIdsTensor,         // int64[1, S], S ≤ 384
  attention_mask: attentionMaskTensor, // int64[1, S]
});

const categoryProbs = outputs.category_probs.data; // Float32[6]
const actionProb = outputs.action_prob.data[0];    // Float32

const CATEGORIES = ['ALERT', 'NEWSLETTER', 'PERSONAL', 'PROMOTIONAL', 'SOCIAL', 'TRANSACTION'];
const category = CATEGORIES[categoryProbs.indexOf(Math.max(...categoryProbs))];
const actionRequired = actionProb > 0.5;

Python (PyTorch)

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("Ippoboi/mmbert-s-email-classifier")
# Load DualHeadClassifier (mean pooling + dual heads) from checkpoint
# (see ml/scripts/train_classifier.py for the head architecture)

text = "Subject: Meeting tomorrow\n\nBody: Can we reschedule to 3pm?"
inputs = tokenizer(text, return_tensors="pt", max_length=384, truncation=True)

with torch.no_grad():
    cat_probs, act_prob = model(inputs["input_ids"], inputs["attention_mask"])
    categories = ["ALERT", "NEWSLETTER", "PERSONAL", "PROMOTIONAL", "SOCIAL", "TRANSACTION"]
    category = categories[cat_probs.argmax()]
    action = act_prob.item() > 0.5

Special Tokens (Gemma 2 BPE)

mmBERT uses Gemma 2's tokenizer — IDs differ from XLM-R/MiniLM:

Token	ID
`<pad>`	0
`<eos>`	1
`<bos>`	2
`<unk>`	3

Sequence wrap: [<bos>, ...content..., <eos>]. There is no [CLS] / [SEP] — that's XLM-R territory.

Files

File	Size	Description
`model.onnx`	135.3 MB	INT8 quantized ONNX model
`tokenizer.json`	32.8 MB	Gemma 2 BPE tokenizer (256K vocab)
`tokenizer_config.json`	45 KB	Tokenizer configuration
`special_tokens_map.json`	1 KB	Special token IDs
`export_metadata.json`	1 KB	Provenance + canonical metrics

Architecture

Input → mmBERT Encoder (22 layers, 384 hidden, RoPE + GeGLU)
                       ↓
              Mean-pool over last_hidden_state (masked by attention_mask)
                       ↓
                 ┌─────┴─────┐
                 ↓           ↓
           Category Head  Action Head
           Linear(384→6)  Linear(384→1)
                 ↓           ↓
             softmax      sigmoid

Compared to Previous Model (MiniLM v6)

	MiniLM v6	mmBERT-small v8 (this)
Base	XLM-R MiniLM-L12	mmBERT-small (ModernBERT family)
Schema	5 classes	6 classes (PROMOTIONAL added)
Languages tracked	English-dominant	English + French (balanced)
Vocab	250K SentencePiece Unigram	256K Gemma 2 BPE
Max sequence	256	384
Inputs	input_ids, attention_mask, token_type_ids	input_ids, attention_mask
Bundle size	113 MB	135 MB
Cat acc	~92.0% (5-class)	93.15% (6-class, harder schema)
FR cat acc	not tracked	92.45%

Limitations

Trained on English and French only; may not generalize to other languages despite the multilingual base
Personal/consumer email patterns; may not generalize to enterprise/corporate email
The PROMOTIONAL ↔ NEWSLETTER decision boundary is genuinely fuzzy; expect some legitimate disagreements with human raters at this boundary
Action accuracy lost ~2.2pp under INT8 quantization vs FP32 (95.02% → 92.83%) — the action head is a single Linear(384→1) and is more quantization-sensitive than the 6-class softmax
256K vocab tokenizer is oversized for an EN+FR-only deployment but is required to use the pretrained mmBERT weights without retraining

Notes on Quantization

This model uses INT8 dynamic per-channel quantization via onnxruntime.quantization.quantize_dynamic(weight_type=QInt8, per_channel=True) on a clean FP32 ONNX export. Two export-time fixes were required to preserve accuracy through quantization on this architecture:

The export wrapper drops the unused token_type_ids path (mmBERT has no segment embeddings; an unused embedding lookup contaminates the shared 256K-vocab embedding's scale calibration).
model.encoder.config.reference_compile = False is set before torch.onnx.export(..., dynamo=False) so the legacy tracer can trace through the tok_embeddings lookup directly instead of the compiled_embeddings torch.compile shim.

With both fixes, INT8 dynamic per-channel quantization preserves the FP32 cat_acc essentially exactly. Static-INT8 calibration (percentile/entropy) was attempted but is empirically infeasible on ModernBERT-style graphs in ORT 1.24 due to peak-RAM blowup during calibration of wide-FFN and full-attention activation tensors.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Ippoboi/mmbert-s-email-classifier

Base model

jhu-clsp/mmBERT-small

Quantized

(9)

this model