--- base_model: Qwen/Qwen3-1.7B library_name: transformers model_name: constitutional-safety-classifier tags: - generated_from_trainer - trl - sft - peft - lora - safety-classifier - constitutional-ai - trackio:https://huggingface.co/spaces/imadreamerboy/trackio - hf_jobs - trackio license: other --- # Constitutional Safety Classifier This model is a LoRA fine-tune of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) trained with TRL SFT as a **next-token safety classifier**. Given a constitution and content to classify, it predicts one of two labels: - `safe` - `unsafe` The model is intended for research and evaluation of constitutional safety classification, not as a complete production guardrail by itself. ## Paper-aligned evaluation I evaluated this model against the protocol style of Anthropic's **Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming** ([arXiv:2501.18837](https://arxiv.org/abs/2501.18837)). The exact Anthropic CBRN jailbreak dataset and Claude.ai production traffic are not public, so the evaluation uses public proxies matching the paper's key axes: 1. held-out classifier accuracy, 2. harmful recall / missed-unsafe rate as an ASR proxy, 3. over-refusal / false-positive rate on benign but safety-adjacent prompts. Full results are in [`constitutional_eval_results.json`](./constitutional_eval_results.json). The reproducible script is [`evaluate_constitutional_classifier.py`](./evaluate_constitutional_classifier.py). ### Results at threshold 0.5 | Dataset / proxy | N | Unsafe recall / TPR | Missed unsafe / ASR proxy | Over-refusal / FPR | Unsafe F1 | AUROC | |---|---:|---:|---:|---:|---:|---:| | Held-out classifier data | 500 | 0.936 | 0.064 | 0.332 | 0.825 | 0.930 | | ToxicChat toxicity | 500 | 0.960 | 0.040 | 0.160 | 0.906 | 0.973 | | ToxicChat jailbreak subset | 66 | 0.985 | 0.015 | 0.000 | 0.992 | n/a | | Aegis 2.0 prompt-only | 500 | 0.964 | 0.036 | 0.460 | 0.795 | 0.923 | | Aegis 2.0 prompt+response | 500 | 0.980 | 0.020 | 0.424 | 0.815 | 0.937 | | BeaverTails prompt+response | 500 | 0.952 | 0.048 | 0.324 | 0.837 | 0.934 | | OR-Bench toxic | 500 | 0.996 | 0.004 | 0.000 | 0.998 | n/a | | OR-Bench hard benign | 500 | n/a | n/a | 0.950 | n/a | n/a | | MMLU chemistry benign | 303 | n/a | n/a | 0.0033 | n/a | n/a | ### Interpretation The model has strong harmful-content recall across public proxy datasets: most unsafe recall values are around **95-99.6%**. This suggests the fine-tuning successfully taught the model to recognize many unsafe and jailbreak-like prompts. The main weakness is **over-refusal**. At threshold `0.5`, the model flags many benign but safety-adjacent prompts as unsafe, especially on OR-Bench hard benign prompts, where FPR is **95%**. This is much higher than the paper-style target of roughly **≤5% FPR / increased FPR** on over-refusal datasets. The held-out score distribution is still separable: AUROC is **0.930** on the held-out classifier set and **0.973** on ToxicChat. However, deployment would require threshold calibration and likely more benign hard-negative training data. Held-out threshold sweep: | Constraint | Threshold | TPR | FPR | |---|---:|---:|---:| | FPR ≤ 0.5% | 0.997 | 0.220 | 0.000 | | FPR ≤ 1% | 0.997 | 0.220 | 0.000 | | FPR ≤ 5% | 0.981 | 0.728 | 0.032 | ## Reproduce evaluation ```bash pip install transformers peft accelerate datasets scikit-learn huggingface_hub sentencepiece python evaluate_constitutional_classifier.py \ --max-per-dataset 500 \ --batch-size 8 \ --max-length 2048 \ --threshold 0.5 \ --output constitutional_eval_results.json ``` The evaluator loads the base model, applies this LoRA adapter, formats prompts with [`constitution.json`](./constitution.json), and scores the next-token probability mass assigned to safe/unsafe label tokens. ## Usage This repository contains a PEFT LoRA adapter. For direct scoring, use the evaluation script above. Minimal generation-style use: ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base_model = "Qwen/Qwen3-1.7B" adapter = "imadreamerboy/constitutional-safety-classifier" tok = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(base_model, dtype="auto", device_map="auto", trust_remote_code=True) model = PeftModel.from_pretrained(model, adapter) model.eval() ``` For robust classification, prefer next-token scoring of `safe` vs `unsafe` as implemented in [`evaluate_constitutional_classifier.py`](./evaluate_constitutional_classifier.py), rather than free-form generation parsing. ## Training procedure This model was trained with SFT. ### Framework versions - TRL: 1.2.0 - Transformers: 5.5.4 - PyTorch: 2.11.0 - Datasets: 4.8.4 - Tokenizers: 0.22.2