natong19
/

refusal_classifier

Model card Files Files and versions

natong19 commited on Dec 18, 2025

Commit

98e934a

·

verified ·

1 Parent(s): 8cf4078

Update README.md

Files changed (1) hide show

README.md +10 -5

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ A robust, performant classifier that excels at **detecting refusals, moralizatio
 ### Training Details
-Trained for 1 epoch on 112,102 carefully deduplicated, labeled and filtered samples (56,051 non-refusals and 56,051 refusals).
 Most of the samples were sourced from:
 - [natong19/lmsys-chat-1m-filtered](https://huggingface.co/datasets/natong19/lmsys-chat-1m-filtered)
@@ -38,13 +38,18 @@ Majority vote from multiple refusal classifiers and LLM-as-a-judge were employed
 <div align="left">
 <img src="figures/plot.png" width="100%" alt="Plot"/>
 </div>
-Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers. Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.
-`alpha_model` is a earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
 The training and test sets have similar distributions, but several factors suggest against overfitting:
-the dataset is relatively large and exactly balanced, training was limited to a single epoch, and [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available — achieves similarly strong, balanced performance on the same test set.
-A more detailed breakdown is as follows:
 | Model                                     | TP   | FN   | FP  | TN   | Accuracy | Precision | Recall | F1     |
 | ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |

 ### Training Details
+Trained for 1 epoch on 112,102 carefully deduplicated, labeled, filtered and balanced samples (56,051 non-refusals and 56,051 refusals).
 Most of the samples were sourced from:
 - [natong19/lmsys-chat-1m-filtered](https://huggingface.co/datasets/natong19/lmsys-chat-1m-filtered)
 <div align="left">
 <img src="figures/plot.png" width="100%" alt="Plot"/>
 </div>
+Inference throughput vs F1 score on the test set (2,900 non-refusals and 2,900 refusals) for several open-source refusal classifiers.
+Throughput benchmarked with sequence length 512, batch size 16 on 1x NVIDIA RTX Pro 6000.
+`alpha_model` is an earlier checkpoint that I wasn't completely satisfied with, but it was leveraged for the final round of data curation.
 The training and test sets have similar distributions, but several factors suggest against overfitting:
+- the dataset is relatively large and exactly balanced
+- training was run for only a single epoch
+- train/val loss is similar
+- [Minos-v1](https://huggingface.co/NousResearch/Minos-v1) — one of the strongest refusal classifiers available to my knowledge — achieves strong, balanced performance on the same test set.
+A more detailed breakdown of the evaluation results of the different classifiers is as follows:
 | Model                                     | TP   | FN   | FP  | TN   | Accuracy | Precision | Recall | F1     |
 | ----------------------------------------- | ---- | ---- | --- | ---- | -------- | --------- | ------ | ------ |