| | --- |
| | license: mit |
| | language: en |
| | tags: |
| | - text-classification |
| | - ai-safety |
| | - refusal-detection |
| | - bert |
| | - multi-task-learning |
| | metrics: |
| | - accuracy |
| | - f1 |
| | widget: |
| | - text: "I cannot access that file because it contains sensitive credentials." |
| | example_title: "Refusal with Explanation" |
| | - text: "Sure, let me read that file for you." |
| | example_title: "Compliant Response" |
| | - text: "I cannot do that." |
| | example_title: "Refusal without Explanation" |
| | - text: "I'll run that command now using the bash tool." |
| | example_title: "Compliant Action" |
| | --- |
| | |
| | # AI Safety Refusal Classifier (BERT) |
| |
|
| | ## Model Description |
| |
|
| | This is a multi-task BERT classifier designed for AI safety applications, specifically for detecting: |
| | 1. **Refusal Detection**: Whether an AI assistant response refuses a request |
| | 2. **Explanation Detection**: Whether the refusal includes an explanation of why |
| |
|
| | ## Model Architecture |
| |
|
| | - **Base Model**: DistilBERT (`distilbert-base-uncased`) - 66M parameters |
| | - **Architecture**: Multi-task learning with shared encoder and two classification heads |
| | - Refusal classification head: Binary (Refusal vs. Compliant) |
| | - Explanation classification head: Binary (Has Explanation vs. No Explanation) |
| | - **Training**: Fine-tuned on 300 diverse assistant responses using combined cross-entropy loss |
| |
|
| | ## Performance |
| |
|
| | Trained for 3 epochs on 300 examples (80/20 train/eval split): |
| |
|
| | | Task | Accuracy | F1 Score | |
| | |------|----------|----------| |
| | | Refusal Detection | **98.3%** | 0.98 | |
| | | Explanation Detection | **98.3%** | 0.98 | |
| |
|
| | **Test Examples**: |
| | ```python |
| | # Perfect predictions on all test cases: |
| | "I cannot access that file because it contains sensitive credentials." |
| | → Refusal: True, Explanation: True |
| | |
| | "Sure, let me read that file for you." |
| | → Refusal: False, Explanation: False |
| | |
| | "I cannot do that." |
| | → Refusal: True, Explanation: False |
| | |
| | "I'll run that command now using the bash tool." |
| | → Refusal: False, Explanation: False |
| | ``` |
| |
|
| | ## Training Data |
| |
|
| | The model was trained on 300 synthetic assistant responses generated using [DeepFabric](https://deepfabric.dev) topic graph of 1,111 malicious system actions: |
| |
|
| | - **120 Compliant Responses (40%)**: Assistant agrees to perform dangerous actions |
| | - Examples: "I'll run that fork bomb...", "Executing the SQL injection..." |
| | - **108 Refusals with Explanation (36%)**: Assistant refuses and explains security risks |
| | - Examples: "I cannot do that as it will crash the system...", "I refuse to expose those files..." |
| | - **72 Refusals without Explanation (24%)**: Brief refusals |
| | - Examples: "I cannot do that.", "I'm unable to comply." |
| |
|
| | Training data was explicitly designed to avoid overfitting on specific phrases by using varied phrasing patterns across all categories. |
| |
|
| |
|
| | ## Intended Use Cases |
| |
|
| | 2. **Response Filtering**: Identify refusals in AI assistant outputs |
| | 3. **Safety Evaluation**: Assess model behavior on adversarial inputs |
| | 4. **Dataset Annotation**: Label large-scale conversational datasets |
| |
|
| | ## Limitations |
| |
|
| | 1. **Domain-Specific**: Trained on security/system administration contexts; may not generalize to other domains |
| | 3. **Binary Classification**: Cannot detect partial refusals or ambiguous responses |
| | 4. **Context-Free**: Classifies single responses without conversation history |
| | 5. **English Only**: Trained exclusively on English text |
| |
|
| | ## Training Details |
| |
|
| | - **Framework**: PyTorch 2.6.0 with Hugging Face Transformers |
| | - **Optimizer**: AdamW (lr=2e-5, weight_decay=0.01) |
| | - **Epochs**: 3 |
| | - **Batch Size**: 16 |
| | - **Max Sequence Length**: 128 tokens |
| | - **Hardware**: Apple Silicon (MPS) / CUDA / CPU compatible |
| | - **Training Time**: ~10 seconds on M-series Mac |
| | |
| | ## Ethical Considerations |
| | |
| | This model is designed for **defensive AI safety research only**. It should not be used to: |
| | - Bypass safety mechanisms in production AI systems |
| | - Generate adversarial inputs for malicious purposes |
| | - Evaluate proprietary models without authorization |
| | |
| | The training data includes examples of dangerous actions (DoS attacks, data exfiltration, etc.) for educational purposes within controlled environments. |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @software{ai_safety_refusal_classifier, |
| | author = {Luke Hinds}, |
| | title = {AI Safety Refusal Classifier}, |
| | year = {2026}, |
| | publisher = {Hugging Face}, |
| | url = {https://huggingface.co/lukehinds/ai-safety-refusal-classifier} |
| | } |
| | ``` |
| | |