--- license: mit language: en tags: - text-classification - ai-safety - refusal-detection - bert - multi-task-learning metrics: - accuracy - f1 widget: - text: "I cannot access that file because it contains sensitive credentials." example_title: "Refusal with Explanation" - text: "Sure, let me read that file for you." example_title: "Compliant Response" - text: "I cannot do that." example_title: "Refusal without Explanation" - text: "I'll run that command now using the bash tool." example_title: "Compliant Action" --- # AI Safety Refusal Classifier (BERT) ## Model Description This is a multi-task BERT classifier designed for AI safety applications, specifically for detecting: 1. **Refusal Detection**: Whether an AI assistant response refuses a request 2. **Explanation Detection**: Whether the refusal includes an explanation of why ## Model Architecture - **Base Model**: DistilBERT (`distilbert-base-uncased`) - 66M parameters - **Architecture**: Multi-task learning with shared encoder and two classification heads - Refusal classification head: Binary (Refusal vs. Compliant) - Explanation classification head: Binary (Has Explanation vs. No Explanation) - **Training**: Fine-tuned on 300 diverse assistant responses using combined cross-entropy loss ## Performance Trained for 3 epochs on 300 examples (80/20 train/eval split): | Task | Accuracy | F1 Score | |------|----------|----------| | Refusal Detection | **98.3%** | 0.98 | | Explanation Detection | **98.3%** | 0.98 | **Test Examples**: ```python # Perfect predictions on all test cases: "I cannot access that file because it contains sensitive credentials." → Refusal: True, Explanation: True "Sure, let me read that file for you." → Refusal: False, Explanation: False "I cannot do that." → Refusal: True, Explanation: False "I'll run that command now using the bash tool." → Refusal: False, Explanation: False ``` ## Training Data The model was trained on 300 synthetic assistant responses generated using [DeepFabric](https://deepfabric.dev) topic graph of 1,111 malicious system actions: - **120 Compliant Responses (40%)**: Assistant agrees to perform dangerous actions - Examples: "I'll run that fork bomb...", "Executing the SQL injection..." - **108 Refusals with Explanation (36%)**: Assistant refuses and explains security risks - Examples: "I cannot do that as it will crash the system...", "I refuse to expose those files..." - **72 Refusals without Explanation (24%)**: Brief refusals - Examples: "I cannot do that.", "I'm unable to comply." Training data was explicitly designed to avoid overfitting on specific phrases by using varied phrasing patterns across all categories. ## Intended Use Cases 2. **Response Filtering**: Identify refusals in AI assistant outputs 3. **Safety Evaluation**: Assess model behavior on adversarial inputs 4. **Dataset Annotation**: Label large-scale conversational datasets ## Limitations 1. **Domain-Specific**: Trained on security/system administration contexts; may not generalize to other domains 3. **Binary Classification**: Cannot detect partial refusals or ambiguous responses 4. **Context-Free**: Classifies single responses without conversation history 5. **English Only**: Trained exclusively on English text ## Training Details - **Framework**: PyTorch 2.6.0 with Hugging Face Transformers - **Optimizer**: AdamW (lr=2e-5, weight_decay=0.01) - **Epochs**: 3 - **Batch Size**: 16 - **Max Sequence Length**: 128 tokens - **Hardware**: Apple Silicon (MPS) / CUDA / CPU compatible - **Training Time**: ~10 seconds on M-series Mac ## Ethical Considerations This model is designed for **defensive AI safety research only**. It should not be used to: - Bypass safety mechanisms in production AI systems - Generate adversarial inputs for malicious purposes - Evaluate proprietary models without authorization The training data includes examples of dangerous actions (DoS attacks, data exfiltration, etc.) for educational purposes within controlled environments. ## Citation ```bibtex @software{ai_safety_refusal_classifier, author = {Luke Hinds}, title = {AI Safety Refusal Classifier}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/lukehinds/ai-safety-refusal-classifier} } ```