Updated table

73b71b9 verified about 1 month ago

7.71 kB

	---
	license: apache-2.0
	datasets:
	- ealvaradob/phishing-dataset
	- ucberkeley-dlab/measuring-hate-speech
	- cardiffnlp/tweet_eval
	- lmsys/toxic-chat
	- tasksource/jigsaw_toxicity
	language:
	- en
	base_model:
	- answerdotai/ModernBERT-large
	pipeline_tag: text-classification
	tags:
	- moderation
	- safety
	---

	# Horizon 1

	A larger and more modern variant of [Constellation-One](https://huggingface.co/DominicTWHV/Constellation-One-Text) for [Cockatoo](https://cockatoo.dev/) from answerdotai/modernBERT-large

	This model is licensed under the `Apache-2.0` license

	Note:

	`lmsys/toxic-chat` is licensed under `CC-BY-NC-4.0`, meaning this model cannot be legally used for commercial purposes.

	## Hardware:

	This model was fine-tuned on two NVIDIA A40s with a batch size of 32 and gradient accumulation of 2, totaling to an effective batch size of `(322) 2 = 128`

	Fine-tuned on a dataset size of 232k entries aggregated from:

	```csv
	- ealvaradob/phishing-dataset
	- ucberkeley-dlab/measuring-hate-speech
	- cardiffnlp/tweet_eval
	- lmsys/toxic-chat
	- tasksource/jigsaw_toxicity
	```

	## Software

	Training was executed on the [Cockatoo_ML_Training](https://github.com/DominicTWHV/Cockatoo_ML_Training) server. Metrics are publicly visible at [Cockatoo.dev](https://cockatoo.dev/ml-training.html) .

	Techniques: `or` label merging, `merge_labels` on conflict. There have been no manual intervention in data sanitization before/after merging.

	Asymmetric losses:

	```csv
	γ- = 3.5
	γ+ = 0.5
	clipping = 0.05
	```

	Optimizer:

	```csv
	adamw

	betas = (0.9, 0.999)
	eps = 1e-8
	momentum = 0.9
	```

	LLRD:

	```csv
	decay_factor = 0.98
	```

	Hyperparameters:

	```csv
	epoch = 3

	batch_size = 32
	gradient_accumulation = 2

	learning_rate = 5e-5
	weight_decay = 0.1
	warmup_ratio = 0.1

	fp16 = false
	bf16 = true
	tf32 = true

	gradient_checkpointng = false
	gradient_clipping = true
	gradient_clipping_val = 1.0

	attention_implementation = "flash_attention_2"
	```

	## Available Labels:

	```json
	"id2label": {
	"0": "scam",
	"1": "violence",
	"2": "harassment",
	"3": "hate_speech",
	"4": "toxicity",
	"5": "obscenity",
	"6": "genocide" # genocide is a new addition compared to Constellation
	}
	```

	## Performance

	All evaluation metrics are from macro averaging, may contain slight deviations with other data entries due to the discrepancy in different evaluation runs. Metrics from zero-shot evaluation split (not present in training data)

	Horizon 1 achieves very high recall values out of the box (0.94 raw) with a comparable precision compared to Constellation (0.566 raw vs. 0.605).

	However, this model really shines when trigger thresholds have been fine-tuned:

	Default:

	\| Category \| Threshold \| F1-Score \|
	\| :--- \| :--- \| :--- \|
	\| scam \| 0.5 \| 0.8758 \|
	\| violence \| 0.5 \| 0.6891 \|
	\| harassment \| 0.5 \| 0.8279 \|
	\| hate_speech \| 0.5 \| 0.6581 \|
	\| toxicity \| 0.5 \| 0.6430 \|
	\| obscenity \| 0.5 \| 0.6428 \|
	\| genocide \| 0.5 \| 0.5630 \|
	\| Average \| - \| 0.7000 \|

	![Recall Metrics](assets/graphs/untuned/recall_modernbert.png)
	![Precision Metrics](assets/graphs/untuned/precision_modernbert.png)
	![F1 Metrics](assets/graphs/untuned/f1_modernbert.png)

	Tuned:

	\| Category \| Threshold \| F1-Score \| Delta (vs. default) \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| scam \| 0.7129 \| 0.9131 \| +0.0373 \|
	\| violence \| 0.6238 \| 0.7252 \| +0.0361 \|
	\| harassment \| 0.6535 \| 0.8712 \| +0.0433 \|
	\| hate_speech \| 0.6040 \| 0.7082 \| +0.0501 \|
	\| toxicity \| 0.6238 \| 0.7371 \| +0.0941 \|
	\| obscenity \| 0.6238 \| 0.7309 \| +0.0881 \|
	\| genocide \| 0.6337 \| 0.5929 \| +0.0299 \|
	\| Average \| - \| 0.7541 \| +0.0541

	![Recall Metrics](assets/graphs/tuned/recall_modernbert.png)
	![Precision Metrics](assets/graphs/tuned/precision_modernbert.png)
	![F1 Metrics](assets/graphs/tuned/f1_modernbert.png)

	## Comparison with Constellation One (tuned):

	\| Metric \| Constellation One \| Horizon 1 \| Delta (H1 - C1) \|
	\| --- \| --- \| --- \| --- \|
	\| Loss \| 0.1603 \| 0.0245 \| -0.1358 \|
	\| Overall Precision \| 0.6940 \| 0.6809 \| -0.0131 \|
	\| Overall Recall \| 0.8151 \| 0.8554 \| +0.0403 \|
	\| Overall F1 \| 0.7475 \| 0.7448 \| -0.0027 \|
	\| Scam Precision \| 0.9255 \| 0.9330 \| +0.0075 \|
	\| Scam Recall \| 0.9467 \| 0.9009 \| -0.0459 \|
	\| Scam F1 \| 0.9360 \| 0.9167 \| -0.0194 \|
	\| Violence Precision \| 0.5141 \| 0.6293 \| +0.1152 \|
	\| Violence Recall \| 0.7191 \| 0.8828 \| +0.1637 \|
	\| Violence F1 \| 0.5995 \| 0.7348 \| +0.1353 \|
	\| Harassment Precision \| 0.8238 \| 0.8329 \| +0.0091 \|
	\| Harassment Recall \| 0.8830 \| 0.9240 \| +0.0410 \|
	\| Harassment F1 \| 0.8524 \| 0.8761 \| +0.0237 \|
	\| Hate Speech Precision \| 0.5607 \| 0.5965 \| +0.0358 \|
	\| Hate Speech Recall \| 0.6960 \| 0.8652 \| +0.1692 \|
	\| Hate Speech F1 \| 0.6211 \| 0.7061 \| +0.0850 \|
	\| Toxicity Precision \| 0.6891 \| 0.6946 \| +0.0056 \|
	\| Toxicity Recall \| 0.8025 \| 0.7481 \| -0.0544 \|
	\| Toxicity F1 \| 0.7415 \| 0.7204 \| -0.0211 \|
	\| Obscenity Precision \| 0.6507 \| 0.6828 \| +0.0321 \|
	\| Obscenity Recall \| 0.8431 \| 0.7160 \| -0.1271 \|
	\| Obscenity F1 \| 0.7345 \| 0.6990 \| -0.0355 \|
	\| Genocide Precision \| N/A \| 0.3972 \| N/A \|
	\| Genocide Recall \| N/A \| 0.9511 \| N/A \|
	\| Genocide F1 \| N/A \| 0.5604 \| N/A \|

	> [!NOTE]
	> This model is more "trigger-happy" compared to Constellation One, albeit this can be mitigated in production by increasing thresholds (current values optimized for macro F1).

	A newer version is planned to mitigate this behavior.

	## Resources:

	Training/Inferencing server: https://github.com/DominicTWHV/Cockatoo_ML_Training/

	Training Metrics: https://cockatoo.dev/ml-training.html

	## Datasets Used \| Citations

	\| Dataset \| License \| Link \|
	\| --- \| --- \| --- \|
	\| Phishing Dataset \| MIT \| [Hugging Face](https://huggingface.co/datasets/ealvaradob/phishing-dataset) \|
	\| Measuring Hate Speech \| CC-BY-4.0 \| [Hugging Face](https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech) \|
	\| Tweet Eval (SemEval-2019) \| [See Citation]* \| [Hugging Face](https://huggingface.co/datasets/cardiffnlp/tweet_eval) \|
	\| Toxic Chat \| CC-BY-NC-4.0 \| [Hugging Face](https://huggingface.co/datasets/lmsys/toxic-chat) \|
	\| Jigsaw Toxicity \| Apache-2.0 \| [Hugging Face](https://huggingface.co/datasets/tasksource/jigsaw_toxicity) \|

	---

	### Citation: ucberkeley-dlab/measuring-hate-speech

	```bibtex
	@article{kennedy2020constructing,
	title={Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application},
	author={Kennedy, Chris J and Bacon, Geoff and Sahn, Alexander and von Vacano, Claudia},
	journal={arXiv preprint arXiv:2009.10277},
	year={2020}
	}
	```

	### Citation: cardiffnlp/tweet_eval

	```bibtex
	@inproceedings{basile-etal-2019-semeval,
	title = "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter",
	author = "Basile, Valerio and Bosco, Cristina and Fersini, Elisabetta and Nozza, Debora and Patti, Viviana and Rangel Pardo, Francisco Manuel and Rosso, Paolo and Sanguinetti, Manuela",
	booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
	year = "2019",
	address = "Minneapolis, Minnesota, USA",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/S19-2007",
	doi = "10.18653/v1/S19-2007",
	pages = "54--63"
	}

	```

	### Citation: lmsys/toxic-chat

	```bibtex
	@misc{lin2023toxicchat,
	title={ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation},
	author={Zi Lin and Zihan Wang and Yongqi Tong and Yangkun Wang and Yuxin Guo and Yujia Wang and Jingbo Shang},
	year={2023},
	eprint={2310.17389},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```