|
|
---
|
|
|
language: en
|
|
|
license: mit
|
|
|
pipeline_tag: text-classification
|
|
|
library_name: transformers
|
|
|
tags:
|
|
|
- spam
|
|
|
- ham
|
|
|
- email
|
|
|
- tinybert
|
|
|
- enron
|
|
|
- text-classification
|
|
|
model-index:
|
|
|
- name: prancyFox/tiny-bert-enron-spam
|
|
|
results:
|
|
|
- task:
|
|
|
type: text-classification
|
|
|
name: Spam / Ham Classification
|
|
|
dataset:
|
|
|
name: Enron (processed CSV)
|
|
|
type: enron_email
|
|
|
split: test
|
|
|
metrics:
|
|
|
- name: F1 (macro)
|
|
|
type: f1
|
|
|
value: 0.7666
|
|
|
- name: ROC-AUC
|
|
|
type: roc_auc
|
|
|
value: 0.9977
|
|
|
- name: Precision (spam)
|
|
|
type: precision
|
|
|
value: 0.9954
|
|
|
- name: Recall (spam)
|
|
|
type: recall
|
|
|
value: 0.5632
|
|
|
- name: Precision (ham)
|
|
|
type: precision
|
|
|
value: 0.6875
|
|
|
- name: Recall (ham)
|
|
|
type: recall
|
|
|
value: 0.9973
|
|
|
base_model: huawei-noah/TinyBERT_General_4L_312D
|
|
|
---
|
|
|
|
|
|
# TinyBERT Spam Classifier (Enron)
|
|
|
|
|
|
A compact **TinyBERT (4-layer, 312 hidden)** model fine-tuned to classify **email text** as **spam** or **ham**.
|
|
|
Trained on an Enron-derived CSV with light email-specific cleaning (e.g., removing quoted lines and base64-like blobs).
|
|
|
Optimized for **low false positives** by default; adjust the decision threshold if you want higher spam recall.
|
|
|
|
|
|
> Labels: `ham` (0) and `spam` (1)
|
|
|
|
|
|
---
|
|
|
|
|
|
## โจ Quick Start
|
|
|
|
|
|
```python
|
|
|
from transformers import pipeline
|
|
|
|
|
|
clf = pipeline(
|
|
|
"text-classification",
|
|
|
model="prancyFox/tiny-bert-enron-spam",
|
|
|
truncation=True # recommended for long emails
|
|
|
)
|
|
|
|
|
|
clf("Congratulations! You won a FREE iPhone. Click here now!")
|
|
|
# [{'label': 'spam', 'score': 0.98}]
|
|
|
````
|
|
|
|
|
|
**Batch inference**
|
|
|
|
|
|
```python
|
|
|
texts = [
|
|
|
"Meeting moved to 3pm, see agenda attached.",
|
|
|
"FREE gift card!!! Act now!",
|
|
|
]
|
|
|
preds = clf(texts, truncation=True)
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐ Intended Use & Limitations
|
|
|
|
|
|
**Intended use**
|
|
|
|
|
|
* Classifying **email bodies (and optionally subject+body)** as spam vs ham.
|
|
|
* Low-latency scenarios where a small model is preferred.
|
|
|
|
|
|
**Out of scope / Limitations**
|
|
|
|
|
|
* Non-English email content may reduce accuracy.
|
|
|
* Long threads with heavy quoting/footers can dilute signal (use truncation + cleaning).
|
|
|
* Trained on Enron-style corporate emails; consumer emails may differ (consider further fine-tuning).
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐งฐ How We Preprocessed the Data
|
|
|
|
|
|
Light normalization aimed at keeping semantic content:
|
|
|
|
|
|
* Remove long base64-like blobs.
|
|
|
* Drop quoted lines starting with `>` or `|`.
|
|
|
* Optional: concatenate `Subject + "\n" + Message` when available.
|
|
|
* Collapse repeated whitespace.
|
|
|
|
|
|
(You can replicate similar cleaning in your serving pipeline for alignment.)
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐๏ธ Training Details
|
|
|
|
|
|
* **Base model:** `huawei-noah/TinyBERT_General_4L_312D`
|
|
|
* **Task:** Binary text classification (`ham`=0, `spam`=1)
|
|
|
* **Tokenizer:** fast BERT tokenizer (uncased)
|
|
|
* **Max length:** 256 tokens
|
|
|
* **Optimizer / LR:** AdamW, learning rate `2e-5 โ 5e-5` (final run `3e-5`)
|
|
|
* **Batch size:** 32
|
|
|
* **Epochs:** 4 (early stopping enabled)
|
|
|
* **Warmup:** 10%
|
|
|
* **Weight decay:** 0.01
|
|
|
* **Loss:** Cross-entropy with class weighting (ham/spam balanced from label distribution). Focal loss available in the trainer.
|
|
|
* **Early stopping metric:** `eval_f1`
|
|
|
* **Best checkpoint:** Saved using evaluation on validation set.
|
|
|
|
|
|
> Trainer script: `train/train_tinybert.py` (TinyBERT-compatible, with legacy HF support shims).
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐ Evaluation (Chunked Benchmark Summary)
|
|
|
|
|
|
Metrics below reflect a **chunked evaluation** pass (used for long emails), where the model sees up to 512 tokens per chunk with overlap. Threshold tuned to minimize false positives:
|
|
|
|
|
|
### Classification Report
|
|
|
|
|
|
| Class | Precision | Recall | F1 |
|
|
|
| ------------: | ---------: | ---------: | ---------: |
|
|
|
| ham | 0.6875 | 0.9973 | 0.8139 |
|
|
|
| spam | 0.9954 | 0.5632 | 0.7194 |
|
|
|
| **macro avg** | **0.8414** | **0.7802** | **0.7666** |
|
|
|
|
|
|
* **ROC-AUC:** 0.9977
|
|
|
|
|
|
**Confusion matrix**
|
|
|
|
|
|
```
|
|
|
[[16500 45]
|
|
|
[ 7500 9671]]
|
|
|
```
|
|
|
|
|
|
**Interpretation:** The model is conservative (very few false positives on ham). If you need to catch more spam, **lower the decision threshold** (e.g., from 0.5 โ 0.35) or re-train with a spam-skewed class weight / focal loss.
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐๏ธ Threshold & Long-Email Guidance
|
|
|
|
|
|
* **Threshold:** Default is 0.5. For higher spam recall, try **0.35โ0.45** and evaluate impact on false positives.
|
|
|
* **Long emails:** For multi-paragraph threads, consider **chunking** and aggregating chunk-level spam scores (e.g., max or average). Our reference app uses 512-token chunks with overlap.
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐งช Reproducibility
|
|
|
|
|
|
**Environment**
|
|
|
|
|
|
* Python 3.10/3.11
|
|
|
* `transformers >= 4.40`
|
|
|
* `datasets >= 2.20`
|
|
|
* `evaluate >= 0.4.2`
|
|
|
* `torch >= 2.1`
|
|
|
|
|
|
**Training command (example)**
|
|
|
|
|
|
```bash
|
|
|
python train/train_tinybert.py \
|
|
|
--train data/enron.csv \
|
|
|
--text_col Message --label_col "Spam/Ham" \
|
|
|
--output_dir outputs/tiny-bert-enron-spam \
|
|
|
--epochs 4 --batch_size 32 --lr 3e-5 \
|
|
|
--max_length 256 --fp16
|
|
|
```
|
|
|
|
|
|
**Serving (FastAPI example)**
|
|
|
|
|
|
```bash
|
|
|
python spam_bert.py --serve \
|
|
|
--model prancyFox/tiny-bert-enron-spam \
|
|
|
--model-cache-dir ./models_cache
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐ Files
|
|
|
|
|
|
This repo should include:
|
|
|
|
|
|
* `config.json`
|
|
|
* `pytorch_model.bin` or `model.safetensors`
|
|
|
* `tokenizer.json` and `tokenizer_config.json` (or `vocab.txt` etc.)
|
|
|
* `README.md` (this file)
|
|
|
* (Optional) `label_mapping.json` with `{"ham": 0, "spam": 1}`
|
|
|
|
|
|
---
|
|
|
|
|
|
## โ๏ธ License
|
|
|
|
|
|
* **Model weights & code**: MIT
|
|
|
* **Dataset**: Check the original Enron dataset/license terms before redistribution.
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐ฌ Ethical Considerations & Risks
|
|
|
|
|
|
* False positives can have operational cost (missed legitimate emails). This model is tuned to minimize them; if you change the threshold, validate carefully.
|
|
|
* Spam evolves. Periodically re-train with fresh samples to maintain accuracy.
|
|
|
* Non-English or code-mixed content may degrade performance.
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐งฉ Citation
|
|
|
|
|
|
If you use this model, please cite:
|
|
|
|
|
|
```
|
|
|
@software{tinybert_enron_spam_2025,
|
|
|
title = {TinyBERT Spam Classifier (Enron)},
|
|
|
author = {Ing. Daniel Eder},
|
|
|
year = {2025},
|
|
|
url = {https://huggingface.co/prancyFox/tiny-bert-enron-spam}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
And the TinyBERT paper:
|
|
|
|
|
|
```
|
|
|
@article{jiao2020tinybert,
|
|
|
title={TinyBERT: Distilling BERT for Natural Language Understanding},
|
|
|
author={Jiao, Xiaoqi and Yin, Yichun and others},
|
|
|
journal={Findings of EMNLP},
|
|
|
year={2020}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## ๐ Maintainers
|
|
|
|
|
|
* **Daniel Eder** ([daniel@deder.at](mailto:daniel@deder.at?subject=tiny-bert-enron-spam))
|
|
|
|
|
|
---
|
|
|
|
|
|
### Notes
|
|
|
|
|
|
* For a higher-recall variant, fine-tune with `--use_focal_loss` or increase the spam class weight, then re-evaluate thresholds.
|
|
|
* If you want a **PyTorch Lightning** or **Accelerate** training variant, \~itโs easy to adapt the provided trainer. |