Transformers
Safetensors
PEFT
Generated from Trainer
trl
sft
lora
safety-classifier
constitutional-ai
hf_jobs
trackio
Instructions to use imadreamerboy/constitutional-safety-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use imadreamerboy/constitutional-safety-classifier with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("imadreamerboy/constitutional-safety-classifier", dtype="auto") - PEFT
How to use imadreamerboy/constitutional-safety-classifier with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| base_model: Qwen/Qwen3-1.7B | |
| library_name: transformers | |
| model_name: constitutional-safety-classifier | |
| tags: | |
| - generated_from_trainer | |
| - trl | |
| - sft | |
| - peft | |
| - lora | |
| - safety-classifier | |
| - constitutional-ai | |
| - trackio:https://huggingface.co/spaces/imadreamerboy/trackio | |
| - hf_jobs | |
| - trackio | |
| license: other | |
| # Constitutional Safety Classifier | |
| This model is a LoRA fine-tune of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) trained with TRL SFT as a **next-token safety classifier**. Given a constitution and content to classify, it predicts one of two labels: | |
| - `safe` | |
| - `unsafe` | |
| The model is intended for research and evaluation of constitutional safety classification, not as a complete production guardrail by itself. | |
| ## Paper-aligned evaluation | |
| I evaluated this model against the protocol style of Anthropic's **Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming** ([arXiv:2501.18837](https://arxiv.org/abs/2501.18837)). | |
| The exact Anthropic CBRN jailbreak dataset and Claude.ai production traffic are not public, so the evaluation uses public proxies matching the paper's key axes: | |
| 1. held-out classifier accuracy, | |
| 2. harmful recall / missed-unsafe rate as an ASR proxy, | |
| 3. over-refusal / false-positive rate on benign but safety-adjacent prompts. | |
| Full results are in [`constitutional_eval_results.json`](./constitutional_eval_results.json). The reproducible script is [`evaluate_constitutional_classifier.py`](./evaluate_constitutional_classifier.py). | |
| ### Results at threshold 0.5 | |
| | Dataset / proxy | N | Unsafe recall / TPR | Missed unsafe / ASR proxy | Over-refusal / FPR | Unsafe F1 | AUROC | | |
| |---|---:|---:|---:|---:|---:|---:| | |
| | Held-out classifier data | 500 | 0.936 | 0.064 | 0.332 | 0.825 | 0.930 | | |
| | ToxicChat toxicity | 500 | 0.960 | 0.040 | 0.160 | 0.906 | 0.973 | | |
| | ToxicChat jailbreak subset | 66 | 0.985 | 0.015 | 0.000 | 0.992 | n/a | | |
| | Aegis 2.0 prompt-only | 500 | 0.964 | 0.036 | 0.460 | 0.795 | 0.923 | | |
| | Aegis 2.0 prompt+response | 500 | 0.980 | 0.020 | 0.424 | 0.815 | 0.937 | | |
| | BeaverTails prompt+response | 500 | 0.952 | 0.048 | 0.324 | 0.837 | 0.934 | | |
| | OR-Bench toxic | 500 | 0.996 | 0.004 | 0.000 | 0.998 | n/a | | |
| | OR-Bench hard benign | 500 | n/a | n/a | 0.950 | n/a | n/a | | |
| | MMLU chemistry benign | 303 | n/a | n/a | 0.0033 | n/a | n/a | | |
| ### Interpretation | |
| The model has strong harmful-content recall across public proxy datasets: most unsafe recall values are around **95-99.6%**. This suggests the fine-tuning successfully taught the model to recognize many unsafe and jailbreak-like prompts. | |
| The main weakness is **over-refusal**. At threshold `0.5`, the model flags many benign but safety-adjacent prompts as unsafe, especially on OR-Bench hard benign prompts, where FPR is **95%**. This is much higher than the paper-style target of roughly **≤5% FPR / increased FPR** on over-refusal datasets. | |
| The held-out score distribution is still separable: AUROC is **0.930** on the held-out classifier set and **0.973** on ToxicChat. However, deployment would require threshold calibration and likely more benign hard-negative training data. | |
| Held-out threshold sweep: | |
| | Constraint | Threshold | TPR | FPR | | |
| |---|---:|---:|---:| | |
| | FPR ≤ 0.5% | 0.997 | 0.220 | 0.000 | | |
| | FPR ≤ 1% | 0.997 | 0.220 | 0.000 | | |
| | FPR ≤ 5% | 0.981 | 0.728 | 0.032 | | |
| ## Reproduce evaluation | |
| ```bash | |
| pip install transformers peft accelerate datasets scikit-learn huggingface_hub sentencepiece | |
| python evaluate_constitutional_classifier.py \ | |
| --max-per-dataset 500 \ | |
| --batch-size 8 \ | |
| --max-length 2048 \ | |
| --threshold 0.5 \ | |
| --output constitutional_eval_results.json | |
| ``` | |
| The evaluator loads the base model, applies this LoRA adapter, formats prompts with [`constitution.json`](./constitution.json), and scores the next-token probability mass assigned to safe/unsafe label tokens. | |
| ## Usage | |
| This repository contains a PEFT LoRA adapter. For direct scoring, use the evaluation script above. Minimal generation-style use: | |
| ```python | |
| from peft import PeftModel | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| base_model = "Qwen/Qwen3-1.7B" | |
| adapter = "imadreamerboy/constitutional-safety-classifier" | |
| tok = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained(base_model, dtype="auto", device_map="auto", trust_remote_code=True) | |
| model = PeftModel.from_pretrained(model, adapter) | |
| model.eval() | |
| ``` | |
| For robust classification, prefer next-token scoring of `safe` vs `unsafe` as implemented in [`evaluate_constitutional_classifier.py`](./evaluate_constitutional_classifier.py), rather than free-form generation parsing. | |
| ## Training procedure | |
| This model was trained with SFT. | |
| ### Framework versions | |
| - TRL: 1.2.0 | |
| - Transformers: 5.5.4 | |
| - PyTorch: 2.11.0 | |
| - Datasets: 4.8.4 | |
| - Tokenizers: 0.22.2 | |