Upload folder using huggingface_hub

97eef92 verified 5 months ago

7.18 kB

	---
	language: en
	license: mit
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- spam
	- ham
	- email
	- tinybert
	- enron
	- text-classification
	model-index:
	- name: prancyFox/tiny-bert-enron-spam
	results:
	- task:
	type: text-classification
	name: Spam / Ham Classification
	dataset:
	name: Enron (processed CSV)
	type: enron_email
	split: test
	metrics:
	- name: F1 (macro)
	type: f1
	value: 0.7666
	- name: ROC-AUC
	type: roc_auc
	value: 0.9977
	- name: Precision (spam)
	type: precision
	value: 0.9954
	- name: Recall (spam)
	type: recall
	value: 0.5632
	- name: Precision (ham)
	type: precision
	value: 0.6875
	- name: Recall (ham)
	type: recall
	value: 0.9973
	base_model: huawei-noah/TinyBERT_General_4L_312D
	---

	# TinyBERT Spam Classifier (Enron)

	A compact TinyBERT (4-layer, 312 hidden) model fine-tuned to classify email text as spam or ham.
	Trained on an Enron-derived CSV with light email-specific cleaning (e.g., removing quoted lines and base64-like blobs).
	Optimized for low false positives by default; adjust the decision threshold if you want higher spam recall.

	> Labels: `ham` (0) and `spam` (1)

	---

	## ✨ Quick Start

	```python
	from transformers import pipeline

	clf = pipeline(
	"text-classification",
	model="prancyFox/tiny-bert-enron-spam",
	truncation=True # recommended for long emails
	)

	clf("Congratulations! You won a FREE iPhone. Click here now!")
	# [{'label': 'spam', 'score': 0.98}]
	````

	Batch inference

	```python
	texts = [
	"Meeting moved to 3pm, see agenda attached.",
	"FREE gift card!!! Act now!",
	]
	preds = clf(texts, truncation=True)
	```

	---

	## 🔎 Intended Use & Limitations

	Intended use

	* Classifying email bodies (and optionally subject+body) as spam vs ham.
	* Low-latency scenarios where a small model is preferred.

	Out of scope / Limitations

	* Non-English email content may reduce accuracy.
	* Long threads with heavy quoting/footers can dilute signal (use truncation + cleaning).
	* Trained on Enron-style corporate emails; consumer emails may differ (consider further fine-tuning).

	---

	## 🧰 How We Preprocessed the Data

	Light normalization aimed at keeping semantic content:

	* Remove long base64-like blobs.
	* Drop quoted lines starting with `>` or `\|`.
	* Optional: concatenate `Subject + "\n" + Message` when available.
	* Collapse repeated whitespace.

	(You can replicate similar cleaning in your serving pipeline for alignment.)

	---

	## 🏋️ Training Details

	* Base model: `huawei-noah/TinyBERT_General_4L_312D`
	* Task: Binary text classification (`ham`=0, `spam`=1)
	* Tokenizer: fast BERT tokenizer (uncased)
	* Max length: 256 tokens
	* Optimizer / LR: AdamW, learning rate `2e-5 – 5e-5` (final run `3e-5`)
	* Batch size: 32
	* Epochs: 4 (early stopping enabled)
	* Warmup: 10%
	* Weight decay: 0.01
	* Loss: Cross-entropy with class weighting (ham/spam balanced from label distribution). Focal loss available in the trainer.
	* Early stopping metric: `eval_f1`
	* Best checkpoint: Saved using evaluation on validation set.

	> Trainer script: `train/train_tinybert.py` (TinyBERT-compatible, with legacy HF support shims).

	---

	## 📊 Evaluation (Chunked Benchmark Summary)

	Metrics below reflect a chunked evaluation pass (used for long emails), where the model sees up to 512 tokens per chunk with overlap. Threshold tuned to minimize false positives:

	### Classification Report

	\| Class \| Precision \| Recall \| F1 \|
	\| ------------: \| ---------: \| ---------: \| ---------: \|
	\| ham \| 0.6875 \| 0.9973 \| 0.8139 \|
	\| spam \| 0.9954 \| 0.5632 \| 0.7194 \|
	\| macro avg \| 0.8414 \| 0.7802 \| 0.7666 \|

	* ROC-AUC: 0.9977

	Confusion matrix

	```
	[[16500 45]
	[ 7500 9671]]
	```

	Interpretation: The model is conservative (very few false positives on ham). If you need to catch more spam, lower the decision threshold (e.g., from 0.5 → 0.35) or re-train with a spam-skewed class weight / focal loss.

	---

	## 🎛️ Threshold & Long-Email Guidance

	* Threshold: Default is 0.5. For higher spam recall, try 0.35–0.45 and evaluate impact on false positives.
	* Long emails: For multi-paragraph threads, consider chunking and aggregating chunk-level spam scores (e.g., max or average). Our reference app uses 512-token chunks with overlap.

	---

	## 🧪 Reproducibility

	Environment

	* Python 3.10/3.11
	* `transformers >= 4.40`
	* `datasets >= 2.20`
	* `evaluate >= 0.4.2`
	* `torch >= 2.1`

	Training command (example)

	```bash
	python train/train_tinybert.py \
	--train data/enron.csv \
	--text_col Message --label_col "Spam/Ham" \
	--output_dir outputs/tiny-bert-enron-spam \
	--epochs 4 --batch_size 32 --lr 3e-5 \
	--max_length 256 --fp16
	```

	Serving (FastAPI example)

	```bash
	python spam_bert.py --serve \
	--model prancyFox/tiny-bert-enron-spam \
	--model-cache-dir ./models_cache
	```

	---

	## 📁 Files

	This repo should include:

	* `config.json`
	* `pytorch_model.bin` or `model.safetensors`
	* `tokenizer.json` and `tokenizer_config.json` (or `vocab.txt` etc.)
	* `README.md` (this file)
	* (Optional) `label_mapping.json` with `{"ham": 0, "spam": 1}`

	---

	## ⚖️ License

	* Model weights & code: MIT
	* Dataset: Check the original Enron dataset/license terms before redistribution.

	---

	## 🔬 Ethical Considerations & Risks

	* False positives can have operational cost (missed legitimate emails). This model is tuned to minimize them; if you change the threshold, validate carefully.
	* Spam evolves. Periodically re-train with fresh samples to maintain accuracy.
	* Non-English or code-mixed content may degrade performance.

	---

	## 🧩 Citation

	If you use this model, please cite:

	```
	@software{tinybert_enron_spam_2025,
	title = {TinyBERT Spam Classifier (Enron)},
	author = {Ing. Daniel Eder},
	year = {2025},
	url = {https://huggingface.co/prancyFox/tiny-bert-enron-spam}
	}
	```

	And the TinyBERT paper:

	```
	@article{jiao2020tinybert,
	title={TinyBERT: Distilling BERT for Natural Language Understanding},
	author={Jiao, Xiaoqi and Yin, Yichun and others},
	journal={Findings of EMNLP},
	year={2020}
	}
	```

	---

	## 🛠 Maintainers

	* Daniel Eder ([daniel@deder.at](mailto:daniel@deder.at?subject=tiny-bert-enron-spam))

	---

	### Notes

	* For a higher-recall variant, fine-tune with `--use_focal_loss` or increase the spam class weight, then re-evaluate thresholds.
	* If you want a PyTorch Lightning or Accelerate training variant, \~it’s easy to adapt the provided trainer.