Update model card with paper-aligned evaluation

2b4952d verified about 1 month ago

4.86 kB

	---
	base_model: Qwen/Qwen3-1.7B
	library_name: transformers
	model_name: constitutional-safety-classifier
	tags:
	- generated_from_trainer
	- trl
	- sft
	- peft
	- lora
	- safety-classifier
	- constitutional-ai
	- trackio:https://huggingface.co/spaces/imadreamerboy/trackio
	- hf_jobs
	- trackio
	license: other
	---

	# Constitutional Safety Classifier

	This model is a LoRA fine-tune of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) trained with TRL SFT as a next-token safety classifier. Given a constitution and content to classify, it predicts one of two labels:

	- `safe`
	- `unsafe`

	The model is intended for research and evaluation of constitutional safety classification, not as a complete production guardrail by itself.

	## Paper-aligned evaluation

	I evaluated this model against the protocol style of Anthropic's Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming ([arXiv:2501.18837](https://arxiv.org/abs/2501.18837)).

	The exact Anthropic CBRN jailbreak dataset and Claude.ai production traffic are not public, so the evaluation uses public proxies matching the paper's key axes:

	1. held-out classifier accuracy,
	2. harmful recall / missed-unsafe rate as an ASR proxy,
	3. over-refusal / false-positive rate on benign but safety-adjacent prompts.

	Full results are in [`constitutional_eval_results.json`](./constitutional_eval_results.json). The reproducible script is [`evaluate_constitutional_classifier.py`](./evaluate_constitutional_classifier.py).

	### Results at threshold 0.5

	\| Dataset / proxy \| N \| Unsafe recall / TPR \| Missed unsafe / ASR proxy \| Over-refusal / FPR \| Unsafe F1 \| AUROC \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| Held-out classifier data \| 500 \| 0.936 \| 0.064 \| 0.332 \| 0.825 \| 0.930 \|
	\| ToxicChat toxicity \| 500 \| 0.960 \| 0.040 \| 0.160 \| 0.906 \| 0.973 \|
	\| ToxicChat jailbreak subset \| 66 \| 0.985 \| 0.015 \| 0.000 \| 0.992 \| n/a \|
	\| Aegis 2.0 prompt-only \| 500 \| 0.964 \| 0.036 \| 0.460 \| 0.795 \| 0.923 \|
	\| Aegis 2.0 prompt+response \| 500 \| 0.980 \| 0.020 \| 0.424 \| 0.815 \| 0.937 \|
	\| BeaverTails prompt+response \| 500 \| 0.952 \| 0.048 \| 0.324 \| 0.837 \| 0.934 \|
	\| OR-Bench toxic \| 500 \| 0.996 \| 0.004 \| 0.000 \| 0.998 \| n/a \|
	\| OR-Bench hard benign \| 500 \| n/a \| n/a \| 0.950 \| n/a \| n/a \|
	\| MMLU chemistry benign \| 303 \| n/a \| n/a \| 0.0033 \| n/a \| n/a \|

	### Interpretation

	The model has strong harmful-content recall across public proxy datasets: most unsafe recall values are around 95-99.6%. This suggests the fine-tuning successfully taught the model to recognize many unsafe and jailbreak-like prompts.

	The main weakness is over-refusal. At threshold `0.5`, the model flags many benign but safety-adjacent prompts as unsafe, especially on OR-Bench hard benign prompts, where FPR is 95%. This is much higher than the paper-style target of roughly ≤5% FPR / increased FPR on over-refusal datasets.

	The held-out score distribution is still separable: AUROC is 0.930 on the held-out classifier set and 0.973 on ToxicChat. However, deployment would require threshold calibration and likely more benign hard-negative training data.

	Held-out threshold sweep:

	\| Constraint \| Threshold \| TPR \| FPR \|
	\|---\|---:\|---:\|---:\|
	\| FPR ≤ 0.5% \| 0.997 \| 0.220 \| 0.000 \|
	\| FPR ≤ 1% \| 0.997 \| 0.220 \| 0.000 \|
	\| FPR ≤ 5% \| 0.981 \| 0.728 \| 0.032 \|

	## Reproduce evaluation

	```bash
	pip install transformers peft accelerate datasets scikit-learn huggingface_hub sentencepiece

	python evaluate_constitutional_classifier.py \
	--max-per-dataset 500 \
	--batch-size 8 \
	--max-length 2048 \
	--threshold 0.5 \
	--output constitutional_eval_results.json
	```

	The evaluator loads the base model, applies this LoRA adapter, formats prompts with [`constitution.json`](./constitution.json), and scores the next-token probability mass assigned to safe/unsafe label tokens.

	## Usage

	This repository contains a PEFT LoRA adapter. For direct scoring, use the evaluation script above. Minimal generation-style use:

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer

	base_model = "Qwen/Qwen3-1.7B"
	adapter = "imadreamerboy/constitutional-safety-classifier"

	tok = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(base_model, dtype="auto", device_map="auto", trust_remote_code=True)
	model = PeftModel.from_pretrained(model, adapter)
	model.eval()
	```

	For robust classification, prefer next-token scoring of `safe` vs `unsafe` as implemented in [`evaluate_constitutional_classifier.py`](./evaluate_constitutional_classifier.py), rather than free-form generation parsing.

	## Training procedure

	This model was trained with SFT.

	### Framework versions

	- TRL: 1.2.0
	- Transformers: 5.5.4
	- PyTorch: 2.11.0
	- Datasets: 4.8.4
	- Tokenizers: 0.22.2