prancyFox's picture
Upload folder using huggingface_hub
97eef92 verified
---
language: en
license: mit
pipeline_tag: text-classification
library_name: transformers
tags:
- spam
- ham
- email
- tinybert
- enron
- text-classification
model-index:
- name: prancyFox/tiny-bert-enron-spam
results:
- task:
type: text-classification
name: Spam / Ham Classification
dataset:
name: Enron (processed CSV)
type: enron_email
split: test
metrics:
- name: F1 (macro)
type: f1
value: 0.7666
- name: ROC-AUC
type: roc_auc
value: 0.9977
- name: Precision (spam)
type: precision
value: 0.9954
- name: Recall (spam)
type: recall
value: 0.5632
- name: Precision (ham)
type: precision
value: 0.6875
- name: Recall (ham)
type: recall
value: 0.9973
base_model: huawei-noah/TinyBERT_General_4L_312D
---
# TinyBERT Spam Classifier (Enron)
A compact **TinyBERT (4-layer, 312 hidden)** model fine-tuned to classify **email text** as **spam** or **ham**.
Trained on an Enron-derived CSV with light email-specific cleaning (e.g., removing quoted lines and base64-like blobs).
Optimized for **low false positives** by default; adjust the decision threshold if you want higher spam recall.
> Labels: `ham` (0) and `spam` (1)
---
## โœจ Quick Start
```python
from transformers import pipeline
clf = pipeline(
"text-classification",
model="prancyFox/tiny-bert-enron-spam",
truncation=True # recommended for long emails
)
clf("Congratulations! You won a FREE iPhone. Click here now!")
# [{'label': 'spam', 'score': 0.98}]
````
**Batch inference**
```python
texts = [
"Meeting moved to 3pm, see agenda attached.",
"FREE gift card!!! Act now!",
]
preds = clf(texts, truncation=True)
```
---
## ๐Ÿ”Ž Intended Use & Limitations
**Intended use**
* Classifying **email bodies (and optionally subject+body)** as spam vs ham.
* Low-latency scenarios where a small model is preferred.
**Out of scope / Limitations**
* Non-English email content may reduce accuracy.
* Long threads with heavy quoting/footers can dilute signal (use truncation + cleaning).
* Trained on Enron-style corporate emails; consumer emails may differ (consider further fine-tuning).
---
## ๐Ÿงฐ How We Preprocessed the Data
Light normalization aimed at keeping semantic content:
* Remove long base64-like blobs.
* Drop quoted lines starting with `>` or `|`.
* Optional: concatenate `Subject + "\n" + Message` when available.
* Collapse repeated whitespace.
(You can replicate similar cleaning in your serving pipeline for alignment.)
---
## ๐Ÿ‹๏ธ Training Details
* **Base model:** `huawei-noah/TinyBERT_General_4L_312D`
* **Task:** Binary text classification (`ham`=0, `spam`=1)
* **Tokenizer:** fast BERT tokenizer (uncased)
* **Max length:** 256 tokens
* **Optimizer / LR:** AdamW, learning rate `2e-5 โ€“ 5e-5` (final run `3e-5`)
* **Batch size:** 32
* **Epochs:** 4 (early stopping enabled)
* **Warmup:** 10%
* **Weight decay:** 0.01
* **Loss:** Cross-entropy with class weighting (ham/spam balanced from label distribution). Focal loss available in the trainer.
* **Early stopping metric:** `eval_f1`
* **Best checkpoint:** Saved using evaluation on validation set.
> Trainer script: `train/train_tinybert.py` (TinyBERT-compatible, with legacy HF support shims).
---
## ๐Ÿ“Š Evaluation (Chunked Benchmark Summary)
Metrics below reflect a **chunked evaluation** pass (used for long emails), where the model sees up to 512 tokens per chunk with overlap. Threshold tuned to minimize false positives:
### Classification Report
| Class | Precision | Recall | F1 |
| ------------: | ---------: | ---------: | ---------: |
| ham | 0.6875 | 0.9973 | 0.8139 |
| spam | 0.9954 | 0.5632 | 0.7194 |
| **macro avg** | **0.8414** | **0.7802** | **0.7666** |
* **ROC-AUC:** 0.9977
**Confusion matrix**
```
[[16500 45]
[ 7500 9671]]
```
**Interpretation:** The model is conservative (very few false positives on ham). If you need to catch more spam, **lower the decision threshold** (e.g., from 0.5 โ†’ 0.35) or re-train with a spam-skewed class weight / focal loss.
---
## ๐ŸŽ›๏ธ Threshold & Long-Email Guidance
* **Threshold:** Default is 0.5. For higher spam recall, try **0.35โ€“0.45** and evaluate impact on false positives.
* **Long emails:** For multi-paragraph threads, consider **chunking** and aggregating chunk-level spam scores (e.g., max or average). Our reference app uses 512-token chunks with overlap.
---
## ๐Ÿงช Reproducibility
**Environment**
* Python 3.10/3.11
* `transformers >= 4.40`
* `datasets >= 2.20`
* `evaluate >= 0.4.2`
* `torch >= 2.1`
**Training command (example)**
```bash
python train/train_tinybert.py \
--train data/enron.csv \
--text_col Message --label_col "Spam/Ham" \
--output_dir outputs/tiny-bert-enron-spam \
--epochs 4 --batch_size 32 --lr 3e-5 \
--max_length 256 --fp16
```
**Serving (FastAPI example)**
```bash
python spam_bert.py --serve \
--model prancyFox/tiny-bert-enron-spam \
--model-cache-dir ./models_cache
```
---
## ๐Ÿ“ Files
This repo should include:
* `config.json`
* `pytorch_model.bin` or `model.safetensors`
* `tokenizer.json` and `tokenizer_config.json` (or `vocab.txt` etc.)
* `README.md` (this file)
* (Optional) `label_mapping.json` with `{"ham": 0, "spam": 1}`
---
## โš–๏ธ License
* **Model weights & code**: MIT
* **Dataset**: Check the original Enron dataset/license terms before redistribution.
---
## ๐Ÿ”ฌ Ethical Considerations & Risks
* False positives can have operational cost (missed legitimate emails). This model is tuned to minimize them; if you change the threshold, validate carefully.
* Spam evolves. Periodically re-train with fresh samples to maintain accuracy.
* Non-English or code-mixed content may degrade performance.
---
## ๐Ÿงฉ Citation
If you use this model, please cite:
```
@software{tinybert_enron_spam_2025,
title = {TinyBERT Spam Classifier (Enron)},
author = {Ing. Daniel Eder},
year = {2025},
url = {https://huggingface.co/prancyFox/tiny-bert-enron-spam}
}
```
And the TinyBERT paper:
```
@article{jiao2020tinybert,
title={TinyBERT: Distilling BERT for Natural Language Understanding},
author={Jiao, Xiaoqi and Yin, Yichun and others},
journal={Findings of EMNLP},
year={2020}
}
```
---
## ๐Ÿ›  Maintainers
* **Daniel Eder** ([daniel@deder.at](mailto:daniel@deder.at?subject=tiny-bert-enron-spam))
---
### Notes
* For a higher-recall variant, fine-tune with `--use_focal_loss` or increase the spam class weight, then re-evaluate thresholds.
* If you want a **PyTorch Lightning** or **Accelerate** training variant, \~itโ€™s easy to adapt the provided trainer.