create readme

0c3e5e8 verified 13 days ago

9.9 kB

	---
	base_model:
	- cservan/malbert-base-cased-128k
	license: apache-2.0
	language:
	- en
	- fr
	pipeline_tag: text-classification
	inference: false
	tags:
	- classification
	- emails
	- multilingual
	- albert
	- onnx
	- mobile
	- int8
	widget:
	- text: "Subject: Your order has shipped\n\nBody: Your order #12345 is on its way and will arrive by Monday."
	example_title: Transaction (EN)
	- text: "Subject: Réunion demain\n\nBody: Salut, peut-on reporter notre réunion de 14h à 15h ? Dis-moi."
	example_title: Personal (FR)
	- text: "Subject: Weekly Newsletter\n\nBody: Check out our latest deals! 50% off everything this weekend."
	example_title: Newsletter (EN)
	- text: "Subject: Alerte de sécurité\n\nBody: Une nouvelle connexion à votre compte depuis Paris, France. Vérifiez que c'est bien vous."
	example_title: Alert (FR)
	---

	# Email Classifier (mALBERT ONNX)

	A dual-head mALBERT classifier for email category + action prediction, optimized for on-device inference using ONNX Runtime. Bilingual (English + French), 24M parameters, 50.7 MB after INT8 quantization.

	## Model Description

	Classifies emails into 5 categories and predicts whether the recipient should take action:

	\| Category \| Description \|
	\|----------\|-------------\|
	\| PERSONAL \| Direct 1:1 human communication, calendar invites from real people, direct messages. Excludes platform notifications. \|
	\| NEWSLETTER \| Marketing, promotions, subscribed content. Includes weekly digests, year-in-review recaps, marketing-flavored surveys with rewards. \|
	\| TRANSACTION \| Money or order events: receipts, charges, refunds, shipping confirmations with order/booking IDs, payslips, money-transfer notifications. \|
	\| ALERT \| Account, security, or infrastructure messages: password resets, login alerts, CI failures, booking-bound expiry, satisfaction surveys without rewards, named-product update notifications. \|
	\| SOCIAL \| Platform activity between people: post mentions, comment notifications, PR review requests from real users. Excludes automated platform mail (those are ALERT). \|

	The action flag is `true` only when the email requires a concrete response tied to something the user owns or initiated — pay to keep an existing booking, verify a code you requested, accept/decline a calendar invite, reply to a 1:1 message, security event needing verification, or a support ticket follow-up.

	### Output Format

	Single forward pass producing two tensors:
	- `category_probs`: Float32[5] — softmax probabilities per category (argmax = predicted category)
	- `action_prob`: Float32[1] — sigmoid probability of action required (threshold 0.5)

	No text generation, no decoder, no beam search.

	Example:

	```
	Input: "Subject: Your order has shipped\n\nBody: Your order #12345 is on its way..."
	Output: category_probs → TRANSACTION (0.94), action_prob → 0.08 (NO_ACTION)
	```

	## Intended Use

	- Primary: On-device email triage in mobile apps (iOS/Android)
	- Runtime: ONNX Runtime React Native
	- Use case: Prioritizing inbox, filtering noise, surfacing actionable emails

	## Model Details

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Base Model \| `cservan/malbert-base-cased-128k` \|
	\| Parameters \| ~24M \|
	\| Architecture \| ALBERT encoder (parameter-shared, 1 physical block × 12 virtual layers) + dual classification heads \|
	\| Pooling \| `pooler_output` (SOP-pretrained linear + tanh) \|
	\| ONNX Size \| 50.7 MB (INT8 quantized, 1.8× compression from FP32) \|
	\| Max Sequence \| 384 tokens \|
	\| Tokenizer \| SentencePiece Unigram (128K vocab, French-aware) \|
	\| Hidden Size \| 768 \|
	\| Special Tokens \| `[CLS]=2`, `[SEP]=3`, `<pad>=0`, `<unk>=1` \|

	## Performance

	Test set metrics (250 emails, balanced across categories, EN+FR):

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Category Accuracy \| 86.0% (single seed) / 88.4% (2-seed soft-vote ensemble) \|
	\| Action Accuracy \| 84.8% \|
	\| Quantization \| INT8 dynamic, 20/20 PyTorch↔ONNX argmax parity \|

	### Per-language breakdown (single seed)

	\| \| English \| French \|
	\|---\|---\|---\|
	\| Category accuracy \| 85.4% \| 87.0% \|
	\| Action accuracy \| 89.2% \| 77.2% \|

	Notable: French slightly outperforms English on category — the multilingual signal is symmetric. Action accuracy retains an EN advantage (~12 pts) reflecting heavier representation of EN action patterns in training data.

	### Per-class F1 (single seed)

	\| Class \| Precision \| Recall \| F1 \|
	\|---\|---\|---\|---\|
	\| ALERT \| 0.885 \| 0.900 \| 0.893 \|
	\| NEWSLETTER \| 0.771 \| 0.900 \| 0.831 \|
	\| PERSONAL \| 0.917 \| 0.892 \| 0.904 \|
	\| SOCIAL \| 0.862 \| 0.758 \| 0.807 \|
	\| TRANSACTION \| 0.907 \| 0.817 \| 0.860 \|

	## Training Data

	- Source: Personal Gmail inboxes (anonymized)
	- Languages: English, French
	- Size: 2,005 train / 251 val / 250 test (balanced)
	- Labeling: Human-annotated with category + action flag, prompt-assisted with v7 labeling rules (precise tie-breakers for booking-bound deadlines, marketing recaps with reward language, CI/security automation, curated personalized outreach, satisfaction surveys with/without incentives)
	- Input format: `Subject: ...\n\nBody: ...` (no instruction prefix)

	## How to Use

	### ONNX Runtime (React Native)

	```typescript
	import { InferenceSession, Tensor } from 'onnxruntime-react-native';

	const session = await InferenceSession.create('model.onnx');

	const outputs = await session.run({
	input_ids: inputIdsTensor, // int64[1, seq_len]
	attention_mask: attentionMaskTensor, // int64[1, seq_len]
	token_type_ids: tokenTypeIdsTensor, // int64[1, seq_len], all zeros
	});

	const categoryProbs = outputs.category_probs.data; // Float32[5]
	const actionProb = outputs.action_prob.data[0]; // Float32
	```

	### Python (PyTorch reference)

	```python
	from transformers import AutoTokenizer
	import torch

	tokenizer = AutoTokenizer.from_pretrained("Ippoboi/malbert-email-classifier")
	# Load DualHeadClassifier from checkpoint (see ml/scripts/train_classifier.py)

	text = "Subject: Réunion demain\n\nBody: Peut-on reporter à 15h ?"
	inputs = tokenizer(text, return_tensors="pt", max_length=384, truncation=True)

	with torch.no_grad():
	cat_logits, act_logits = model(inputs["input_ids"], inputs["attention_mask"])
	category = ["ALERT", "NEWSLETTER", "PERSONAL", "SOCIAL", "TRANSACTION"][cat_logits.argmax()]
	action = torch.sigmoid(act_logits).item() > 0.5
	```

	### ONNX Runtime (Python)

	```python
	import onnxruntime as ort
	from transformers import AutoTokenizer
	import numpy as np

	session = ort.InferenceSession("model.onnx")
	tokenizer = AutoTokenizer.from_pretrained("Ippoboi/malbert-email-classifier")

	inputs = tokenizer(
	"Subject: Your order has shipped\n\nBody: ...",
	return_tensors="np",
	max_length=384,
	truncation=True,
	padding="max_length",
	)
	cat_probs, act_prob = session.run(
	["category_probs", "action_prob"],
	{
	"input_ids": inputs["input_ids"].astype(np.int64),
	"attention_mask": inputs["attention_mask"].astype(np.int64),
	"token_type_ids": np.zeros_like(inputs["input_ids"], dtype=np.int64),
	},
	)
	categories = ["ALERT", "NEWSLETTER", "PERSONAL", "SOCIAL", "TRANSACTION"]
	print(categories[cat_probs[0].argmax()], "action:", act_prob[0] > 0.5)
	```

	## Files

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `model.onnx` \| 50.7 MB \| INT8 quantized ONNX model \|
	\| `tokenizer.json` \| 8.2 MB \| Fast tokenizer (SentencePiece Unigram, 128K vocab) \|
	\| `spiece.model` \| 2.3 MB \| Raw SentencePiece vocab (optional, for Python reload) \|
	\| `tokenizer_config.json` \| 1.4 KB \| Tokenizer config \|
	\| `special_tokens_map.json` \| 970 B \| Special token names → IDs \|

	## Architecture

	```
	Input → ALBERT Encoder (12 virtual layers × 1 shared block, hidden=768)
	↓
	pooler_output (Linear+tanh on [CLS])
	↓
	┌─────┴─────┐
	↓ ↓
	Category Head Action Head
	Linear(768→5) Linear(768→1)
	↓ ↓
	softmax sigmoid
	↓ ↓
	category_probs action_prob
	```

	ALBERT shares one physical transformer block across all 12 virtual layers. This gives ~24M total parameters (vs ~110M for an equivalent BERT-base) at the cost of representational capacity per virtual depth.

	## Compared to Previous Model (MiniLM v1)

	\| \| MiniLM v1 \| mALBERT v3 (this) \|
	\|---\|---\|---\|
	\| Base architecture \| XLM-R encoder, independent layers \| ALBERT, parameter-shared \|
	\| Parameters \| ~117M \| ~24M \|
	\| ONNX size \| 113 MB \| 50.7 MB \|
	\| Max sequence \| 256 \| 384 \|
	\| Vocab size \| 250K \| 128K \|
	\| Category accuracy \| 92.0% \| 86.0% / 88.4% (ensemble) \|
	\| Action accuracy \| 82.8% \| 84.8% \|
	\| FR cat parity \| EN-favored \| EN/FR symmetric \|

	mALBERT v3 trades raw category accuracy for less than half the on-device footprint, wider context (384 vs 256 tokens), and balanced multilingual performance. Action accuracy is higher; category accuracy is lower in absolute terms but the language gap closes.

	## Limitations

	- Trained on personal email patterns; may not generalize to enterprise/corporate email styles
	- Classification accuracy depends on text quality (plain text preferred over heavy HTML)
	- French action accuracy lags English by ~12 points; the v7 labeling prompt is EN-leaning in its action examples
	- SOCIAL is the weakest category (F1 0.81) — smallest training class (268 examples) and shares features with NEWSLETTER for platform-mass-emails
	- 384-token cap may truncate long emails; ~17% of training emails exceeded this limit
	- ALBERT parameter sharing limits representational depth; for harder boundaries, a non-shared encoder (mDeBERTa-v3-base, MiniLM-L12) would have more capacity at higher inference cost

	## License

	Apache 2.0