| --- |
| license: apache-2.0 |
| datasets: |
| - ealvaradob/phishing-dataset |
| - ucberkeley-dlab/measuring-hate-speech |
| - cardiffnlp/tweet_eval |
| - lmsys/toxic-chat |
| - tasksource/jigsaw_toxicity |
| language: |
| - en |
| base_model: |
| - answerdotai/ModernBERT-large |
| pipeline_tag: text-classification |
| tags: |
| - moderation |
| - safety |
| --- |
| |
| # Horizon 1 |
|
|
| A larger and more modern variant of [Constellation-One](https://huggingface.co/DominicTWHV/Constellation-One-Text) for [Cockatoo](https://cockatoo.dev/) from answerdotai/modernBERT-large |
|
|
| This model is licensed under the `Apache-2.0` license |
|
|
| **Note:** |
|
|
| `lmsys/toxic-chat` is licensed under `CC-BY-NC-4.0`, meaning this model cannot be legally used for commercial purposes. |
|
|
| ## Hardware: |
|
|
| This model was fine-tuned on two NVIDIA A40s with a batch size of 32 and gradient accumulation of 2, totaling to an effective batch size of `(32*2) * 2 = 128` |
|
|
| Fine-tuned on a dataset size of 232k entries aggregated from: |
|
|
| ```csv |
| - ealvaradob/phishing-dataset |
| - ucberkeley-dlab/measuring-hate-speech |
| - cardiffnlp/tweet_eval |
| - lmsys/toxic-chat |
| - tasksource/jigsaw_toxicity |
| ``` |
|
|
| ## Software |
|
|
| Training was executed on the [Cockatoo_ML_Training](https://github.com/DominicTWHV/Cockatoo_ML_Training) server. Metrics are publicly visible at [Cockatoo.dev](https://cockatoo.dev/ml-training.html) . |
|
|
| Techniques: `or` label merging, `merge_labels` on conflict. There have been **no** manual intervention in data sanitization before/after merging. |
|
|
| Asymmetric losses: |
|
|
| ```csv |
| γ- = 3.5 |
| γ+ = 0.5 |
| clipping = 0.05 |
| ``` |
|
|
| Optimizer: |
|
|
| ```csv |
| adamw |
| |
| betas = (0.9, 0.999) |
| eps = 1e-8 |
| momentum = 0.9 |
| ``` |
|
|
| LLRD: |
|
|
| ```csv |
| decay_factor = 0.98 |
| ``` |
|
|
| Hyperparameters: |
|
|
| ```csv |
| epoch = 3 |
| |
| batch_size = 32 |
| gradient_accumulation = 2 |
| |
| learning_rate = 5e-5 |
| weight_decay = 0.1 |
| warmup_ratio = 0.1 |
| |
| fp16 = false |
| bf16 = true |
| tf32 = true |
| |
| gradient_checkpointng = false |
| gradient_clipping = true |
| gradient_clipping_val = 1.0 |
| |
| attention_implementation = "flash_attention_2" |
| ``` |
|
|
| ## Available Labels: |
|
|
| ```json |
| "id2label": { |
| "0": "scam", |
| "1": "violence", |
| "2": "harassment", |
| "3": "hate_speech", |
| "4": "toxicity", |
| "5": "obscenity", |
| "6": "genocide" # genocide is a new addition compared to Constellation |
| } |
| ``` |
|
|
| ## Performance |
|
|
| *All evaluation metrics are from macro averaging, may contain slight deviations with other data entries due to the discrepancy in different evaluation runs. Metrics from zero-shot evaluation split (not present in training data)* |
|
|
| Horizon 1 achieves very high recall values out of the box (0.94 raw) with a comparable precision compared to Constellation (0.566 raw vs. 0.605). |
|
|
| However, this model really shines when trigger thresholds have been fine-tuned: |
|
|
| **Default:** |
|
|
| | Category | Threshold | F1-Score | |
| | :--- | :--- | :--- | |
| | scam | 0.5 | 0.8758 | |
| | violence | 0.5 | 0.6891 | |
| | harassment | 0.5 | 0.8279 | |
| | hate_speech | 0.5 | 0.6581 | |
| | toxicity | 0.5 | 0.6430 | |
| | obscenity | 0.5 | 0.6428 | |
| | genocide | 0.5 | 0.5630 | |
| | **Average** | - | **0.7000** | |
| |
|  |
|  |
|  |
| |
| **Tuned:** |
| |
| | Category | Threshold | F1-Score | Delta (vs. default) | |
| | :--- | :--- | :--- | :--- | |
| | scam | 0.7129 | 0.9131 | +0.0373 | |
| | violence | 0.6238 | 0.7252 | +0.0361 | |
| | harassment | 0.6535 | 0.8712 | +0.0433 | |
| | hate_speech | 0.6040 | 0.7082 | +0.0501 | |
| | toxicity | 0.6238 | 0.7371 | +0.0941 | |
| | obscenity | 0.6238 | 0.7309 | +0.0881 | |
| | genocide | 0.6337 | 0.5929 | +0.0299 | |
| | **Average** | - | **0.7541** | **+0.0541** |
|
|
|  |
|  |
|  |
|
|
| ## Comparison with Constellation One (tuned): |
|
|
| | Metric | Constellation One | Horizon 1 | Delta (H1 - C1) | |
| | --- | --- | --- | --- | |
| | **Loss** | 0.1603 | 0.0245 | **-0.1358** | |
| | **Overall Precision** | 0.6940 | 0.6809 | -0.0131 | |
| | **Overall Recall** | 0.8151 | 0.8554 | **+0.0403** | |
| | **Overall F1** | 0.7475 | 0.7448 | -0.0027 | |
| | Scam Precision | 0.9255 | 0.9330 | **+0.0075** | |
| | Scam Recall | 0.9467 | 0.9009 | -0.0459 | |
| | Scam F1 | 0.9360 | 0.9167 | -0.0194 | |
| | Violence Precision | 0.5141 | 0.6293 | **+0.1152** | |
| | Violence Recall | 0.7191 | 0.8828 | **+0.1637** | |
| | Violence F1 | 0.5995 | 0.7348 | **+0.1353** | |
| | Harassment Precision | 0.8238 | 0.8329 | **+0.0091** | |
| | Harassment Recall | 0.8830 | 0.9240 | **+0.0410** | |
| | Harassment F1 | 0.8524 | 0.8761 | **+0.0237** | |
| | Hate Speech Precision | 0.5607 | 0.5965 | **+0.0358** | |
| | Hate Speech Recall | 0.6960 | 0.8652 | **+0.1692** | |
| | Hate Speech F1 | 0.6211 | 0.7061 | **+0.0850** | |
| | Toxicity Precision | 0.6891 | 0.6946 | **+0.0056** | |
| | Toxicity Recall | 0.8025 | 0.7481 | -0.0544 | |
| | Toxicity F1 | 0.7415 | 0.7204 | -0.0211 | |
| | Obscenity Precision | 0.6507 | 0.6828 | **+0.0321** | |
| | Obscenity Recall | 0.8431 | 0.7160 | -0.1271 | |
| | Obscenity F1 | 0.7345 | 0.6990 | -0.0355 | |
| | Genocide Precision | N/A | 0.3972 | N/A | |
| | Genocide Recall | N/A | 0.9511 | N/A | |
| | Genocide F1 | N/A | 0.5604 | N/A | |
|
|
| > [!NOTE] |
| > This model is more "trigger-happy" compared to Constellation One, albeit this can be mitigated in production by increasing thresholds (current values optimized for macro F1). |
|
|
| A newer version is planned to mitigate this behavior. |
|
|
| ## Resources: |
|
|
| Training/Inferencing server: https://github.com/DominicTWHV/Cockatoo_ML_Training/ |
|
|
| Training Metrics: https://cockatoo.dev/ml-training.html |
|
|
| ## Datasets Used | Citations |
|
|
| | Dataset | License | Link | |
| | --- | --- | --- | |
| | **Phishing Dataset** | MIT | [Hugging Face](https://huggingface.co/datasets/ealvaradob/phishing-dataset) | |
| | **Measuring Hate Speech** | CC-BY-4.0 | [Hugging Face](https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech) | |
| | **Tweet Eval (SemEval-2019)** | [See Citation]* | [Hugging Face](https://huggingface.co/datasets/cardiffnlp/tweet_eval) | |
| | **Toxic Chat** | CC-BY-NC-4.0 | [Hugging Face](https://huggingface.co/datasets/lmsys/toxic-chat) | |
| | **Jigsaw Toxicity** | Apache-2.0 | [Hugging Face](https://huggingface.co/datasets/tasksource/jigsaw_toxicity) | |
|
|
| --- |
|
|
| ### Citation: ucberkeley-dlab/measuring-hate-speech |
|
|
| ```bibtex |
| @article{kennedy2020constructing, |
| title={Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application}, |
| author={Kennedy, Chris J and Bacon, Geoff and Sahn, Alexander and von Vacano, Claudia}, |
| journal={arXiv preprint arXiv:2009.10277}, |
| year={2020} |
| } |
| ``` |
|
|
| ### Citation: cardiffnlp/tweet_eval |
| |
| ```bibtex |
| @inproceedings{basile-etal-2019-semeval, |
| title = "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter", |
| author = "Basile, Valerio and Bosco, Cristina and Fersini, Elisabetta and Nozza, Debora and Patti, Viviana and Rangel Pardo, Francisco Manuel and Rosso, Paolo and Sanguinetti, Manuela", |
| booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation", |
| year = "2019", |
| address = "Minneapolis, Minnesota, USA", |
| publisher = "Association for Computational Linguistics", |
| url = "https://www.aclweb.org/anthology/S19-2007", |
| doi = "10.18653/v1/S19-2007", |
| pages = "54--63" |
| } |
| |
| ``` |
| |
| ### Citation: lmsys/toxic-chat |
| |
| ```bibtex |
| @misc{lin2023toxicchat, |
| title={ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation}, |
| author={Zi Lin and Zihan Wang and Yongqi Tong and Yangkun Wang and Yuxin Guo and Yujia Wang and Jingbo Shang}, |
| year={2023}, |
| eprint={2310.17389}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL} |
| } |
| ``` |