imadreamerboy's picture
Update model card with paper-aligned evaluation
2b4952d verified
---
base_model: Qwen/Qwen3-1.7B
library_name: transformers
model_name: constitutional-safety-classifier
tags:
- generated_from_trainer
- trl
- sft
- peft
- lora
- safety-classifier
- constitutional-ai
- trackio:https://huggingface.co/spaces/imadreamerboy/trackio
- hf_jobs
- trackio
license: other
---
# Constitutional Safety Classifier
This model is a LoRA fine-tune of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) trained with TRL SFT as a **next-token safety classifier**. Given a constitution and content to classify, it predicts one of two labels:
- `safe`
- `unsafe`
The model is intended for research and evaluation of constitutional safety classification, not as a complete production guardrail by itself.
## Paper-aligned evaluation
I evaluated this model against the protocol style of Anthropic's **Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming** ([arXiv:2501.18837](https://arxiv.org/abs/2501.18837)).
The exact Anthropic CBRN jailbreak dataset and Claude.ai production traffic are not public, so the evaluation uses public proxies matching the paper's key axes:
1. held-out classifier accuracy,
2. harmful recall / missed-unsafe rate as an ASR proxy,
3. over-refusal / false-positive rate on benign but safety-adjacent prompts.
Full results are in [`constitutional_eval_results.json`](./constitutional_eval_results.json). The reproducible script is [`evaluate_constitutional_classifier.py`](./evaluate_constitutional_classifier.py).
### Results at threshold 0.5
| Dataset / proxy | N | Unsafe recall / TPR | Missed unsafe / ASR proxy | Over-refusal / FPR | Unsafe F1 | AUROC |
|---|---:|---:|---:|---:|---:|---:|
| Held-out classifier data | 500 | 0.936 | 0.064 | 0.332 | 0.825 | 0.930 |
| ToxicChat toxicity | 500 | 0.960 | 0.040 | 0.160 | 0.906 | 0.973 |
| ToxicChat jailbreak subset | 66 | 0.985 | 0.015 | 0.000 | 0.992 | n/a |
| Aegis 2.0 prompt-only | 500 | 0.964 | 0.036 | 0.460 | 0.795 | 0.923 |
| Aegis 2.0 prompt+response | 500 | 0.980 | 0.020 | 0.424 | 0.815 | 0.937 |
| BeaverTails prompt+response | 500 | 0.952 | 0.048 | 0.324 | 0.837 | 0.934 |
| OR-Bench toxic | 500 | 0.996 | 0.004 | 0.000 | 0.998 | n/a |
| OR-Bench hard benign | 500 | n/a | n/a | 0.950 | n/a | n/a |
| MMLU chemistry benign | 303 | n/a | n/a | 0.0033 | n/a | n/a |
### Interpretation
The model has strong harmful-content recall across public proxy datasets: most unsafe recall values are around **95-99.6%**. This suggests the fine-tuning successfully taught the model to recognize many unsafe and jailbreak-like prompts.
The main weakness is **over-refusal**. At threshold `0.5`, the model flags many benign but safety-adjacent prompts as unsafe, especially on OR-Bench hard benign prompts, where FPR is **95%**. This is much higher than the paper-style target of roughly **≤5% FPR / increased FPR** on over-refusal datasets.
The held-out score distribution is still separable: AUROC is **0.930** on the held-out classifier set and **0.973** on ToxicChat. However, deployment would require threshold calibration and likely more benign hard-negative training data.
Held-out threshold sweep:
| Constraint | Threshold | TPR | FPR |
|---|---:|---:|---:|
| FPR ≤ 0.5% | 0.997 | 0.220 | 0.000 |
| FPR ≤ 1% | 0.997 | 0.220 | 0.000 |
| FPR ≤ 5% | 0.981 | 0.728 | 0.032 |
## Reproduce evaluation
```bash
pip install transformers peft accelerate datasets scikit-learn huggingface_hub sentencepiece
python evaluate_constitutional_classifier.py \
--max-per-dataset 500 \
--batch-size 8 \
--max-length 2048 \
--threshold 0.5 \
--output constitutional_eval_results.json
```
The evaluator loads the base model, applies this LoRA adapter, formats prompts with [`constitution.json`](./constitution.json), and scores the next-token probability mass assigned to safe/unsafe label tokens.
## Usage
This repository contains a PEFT LoRA adapter. For direct scoring, use the evaluation script above. Minimal generation-style use:
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = "Qwen/Qwen3-1.7B"
adapter = "imadreamerboy/constitutional-safety-classifier"
tok = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base_model, dtype="auto", device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
```
For robust classification, prefer next-token scoring of `safe` vs `unsafe` as implemented in [`evaluate_constitutional_classifier.py`](./evaluate_constitutional_classifier.py), rather than free-form generation parsing.
## Training procedure
This model was trained with SFT.
### Framework versions
- TRL: 1.2.0
- Transformers: 5.5.4
- PyTorch: 2.11.0
- Datasets: 4.8.4
- Tokenizers: 0.22.2