natong19
/

refusal_classifier

Safetensors

modernbert

Model card Files Files and versions

xet

Community

natong19 commited on 16 days ago

Commit

15c57ee

verified ·

1 Parent(s): 6669ff0

Update README.md

Browse files

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ license: apache-2.0
 ## Overview
-A robust and performant classifier that excels at **detecting refusals, moralizations, disclaimers, unsolicited advice** and the like.
 ### Model Details
@@ -38,7 +38,7 @@ Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed
 <div align="left">
 <img src="figures/plot.png" width="100%" alt="Plot"/>
 </div>
-Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several refusal open-source classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.
 `alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
@@ -46,7 +46,7 @@ The training and test sets have similar distributions, but several factors sugge
 the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
 A more detailed breakdown is as follows:
-|                                           | TP   | FN   | FP  | TN   | Accuracy | Precision | Recall | F1     |
 | ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
 | [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1)                     | 2782 | 118  | 103 | 2797 | 0.9619   | 0.9643    | 0.9593 | 0.9618 |
 | [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier)          | 1888 | 1012 | 146 | 2754 | 0.8003   | 0.9282    | 0.651  | 0.7653 |

 ## Overview
+A robust, performant classifier that excels at **detecting refusals, moralizations, disclaimers, and unsolicited advice** in LLM responses.
 ### Model Details
 <div align="left">
 <img src="figures/plot.png" width="100%" alt="Plot"/>
 </div>
+Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.
 `alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
 the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
 A more detailed breakdown is as follows:
+| Model                                     | TP   | FN   | FP  | TN   | Accuracy | Precision | Recall | F1     |
 | ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
 | [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1)                     | 2782 | 118  | 103 | 2797 | 0.9619   | 0.9643    | 0.9593 | 0.9618 |
 | [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier)          | 1888 | 1012 | 146 | 2754 | 0.8003   | 0.9282    | 0.651  | 0.7653 |