Update README.md
Browse files
README.md
CHANGED
|
@@ -12,7 +12,7 @@ license: apache-2.0
|
|
| 12 |
|
| 13 |
## Overview
|
| 14 |
|
| 15 |
-
A robust
|
| 16 |
|
| 17 |
### Model Details
|
| 18 |
|
|
@@ -38,7 +38,7 @@ Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed
|
|
| 38 |
<div align="left">
|
| 39 |
<img src="figures/plot.png" width="100%" alt="Plot"/>
|
| 40 |
</div>
|
| 41 |
-
Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several
|
| 42 |
|
| 43 |
`alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
|
| 44 |
|
|
@@ -46,7 +46,7 @@ The training and test sets have similar distributions, but several factors sugge
|
|
| 46 |
the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
|
| 47 |
A more detailed breakdown is as follows:
|
| 48 |
|
| 49 |
-
|
|
| 50 |
| ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
|
| 51 |
| [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) | 2782 | 118 | 103 | 2797 | 0.9619 | 0.9643 | 0.9593 | 0.9618 |
|
| 52 |
| [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier) | 1888 | 1012 | 146 | 2754 | 0.8003 | 0.9282 | 0.651 | 0.7653 |
|
|
|
|
| 12 |
|
| 13 |
## Overview
|
| 14 |
|
| 15 |
+
A robust, performant classifier that excels at **detecting refusals, moralizations, disclaimers, and unsolicited advice** in LLM responses.
|
| 16 |
|
| 17 |
### Model Details
|
| 18 |
|
|
|
|
| 38 |
<div align="left">
|
| 39 |
<img src="figures/plot.png" width="100%" alt="Plot"/>
|
| 40 |
</div>
|
| 41 |
+
Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.
|
| 42 |
|
| 43 |
`alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
|
| 44 |
|
|
|
|
| 46 |
the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
|
| 47 |
A more detailed breakdown is as follows:
|
| 48 |
|
| 49 |
+
| Model | TP | FN | FP | TN | Accuracy | Precision | Recall | F1 |
|
| 50 |
| ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
|
| 51 |
| [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) | 2782 | 118 | 103 | 2797 | 0.9619 | 0.9643 | 0.9593 | 0.9618 |
|
| 52 |
| [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier) | 1888 | 1012 | 146 | 2754 | 0.8003 | 0.9282 | 0.651 | 0.7653 |
|