Update README.md
Browse files
README.md
CHANGED
|
@@ -24,7 +24,7 @@ A robust, performant classifier that excels at **detecting refusals, moralizatio
|
|
| 24 |
|
| 25 |
### Training Details
|
| 26 |
|
| 27 |
-
Trained for 1 epoch on 112,102 carefully deduplicated, labeled and
|
| 28 |
|
| 29 |
Most of the samples were sourced from:
|
| 30 |
- [natong19/lmsys-chat-1m-filtered](https://huggingface.co/datasets/natong19/lmsys-chat-1m-filtered)
|
|
@@ -38,13 +38,18 @@ Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed
|
|
| 38 |
<div align="left">
|
| 39 |
<img src="figures/plot.png" width="100%" alt="Plot"/>
|
| 40 |
</div>
|
| 41 |
-
Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers.
|
|
|
|
| 42 |
|
| 43 |
-
`alpha_model` is
|
| 44 |
|
| 45 |
The training and test sets have similar distributions, but several factors suggest against overfitting:
|
| 46 |
-
the dataset is relatively large and exactly balanced
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
| Model | TP | FN | FP | TN | Accuracy | Precision | Recall | F1 |
|
| 50 |
| ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
|
|
|
|
| 24 |
|
| 25 |
### Training Details
|
| 26 |
|
| 27 |
+
Trained for 1 epoch on 112,102 carefully deduplicated, labeled, filtered and balanced samples (56,051 non-refusals and 56,051 refusals).
|
| 28 |
|
| 29 |
Most of the samples were sourced from:
|
| 30 |
- [natong19/lmsys-chat-1m-filtered](https://huggingface.co/datasets/natong19/lmsys-chat-1m-filtered)
|
|
|
|
| 38 |
<div align="left">
|
| 39 |
<img src="figures/plot.png" width="100%" alt="Plot"/>
|
| 40 |
</div>
|
| 41 |
+
Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers.
|
| 42 |
+
Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.
|
| 43 |
|
| 44 |
+
`alpha_model` is an earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
|
| 45 |
|
| 46 |
The training and test sets have similar distributions, but several factors suggest against overfitting:
|
| 47 |
+
- the dataset is relatively large and exactly balanced
|
| 48 |
+
- training was run for only a single epoch
|
| 49 |
+
- train/val loss is similar
|
| 50 |
+
- [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available to my knowledge — achieves strong, balanced performance on the same test set.
|
| 51 |
+
|
| 52 |
+
A more detailed breakdown of the evaluation results of the different classifiers is as follows:
|
| 53 |
|
| 54 |
| Model | TP | FN | FP | TN | Accuracy | Precision | Recall | F1 |
|
| 55 |
| ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |
|