natong19 commited on
Commit
15c57ee
·
verified ·
1 Parent(s): 6669ff0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -12,7 +12,7 @@ license: apache-2.0
12
 
13
  ## Overview
14
 
15
- A robust and performant classifier that excels at **detecting refusals, moralizations, disclaimers, unsolicited advice** and the like.
16
 
17
  ### Model Details
18
 
@@ -38,7 +38,7 @@ Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed
38
  <div align="left">
39
  <img src="figures/plot.png" width="100%" alt="Plot"/>
40
  </div>
41
- Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several refusal open-source classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.
42
 
43
  `alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
44
 
@@ -46,7 +46,7 @@ The training and test sets have similar distributions, but several factors sugge
46
  the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
47
  A more detailed breakdown is as follows:
48
 
49
- | | TP | FN | FP | TN | Accuracy | Precision | Recall | F1 |
50
  | ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
51
  | [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) | 2782 | 118 | 103 | 2797 | 0.9619 | 0.9643 | 0.9593 | 0.9618 |
52
  | [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier) | 1888 | 1012 | 146 | 2754 | 0.8003 | 0.9282 | 0.651 | 0.7653 |
 
12
 
13
  ## Overview
14
 
15
+ A robust, performant classifier that excels at **detecting refusals, moralizations, disclaimers, and unsolicited advice** in LLM responses.
16
 
17
  ### Model Details
18
 
 
38
  <div align="left">
39
  <img src="figures/plot.png" width="100%" alt="Plot"/>
40
  </div>
41
+ Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.
42
 
43
  `alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
44
 
 
46
  the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
47
  A more detailed breakdown is as follows:
48
 
49
+ | Model | TP | FN | FP | TN | Accuracy | Precision | Recall | F1 |
50
  | ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
51
  | [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) | 2782 | 118 | 103 | 2797 | 0.9619 | 0.9643 | 0.9593 | 0.9618 |
52
  | [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier) | 1888 | 1012 | 146 | 2754 | 0.8003 | 0.9282 | 0.651 | 0.7653 |