lukehinds's picture
Update README.md
6d72e8b verified
---
license: mit
language: en
tags:
- text-classification
- ai-safety
- refusal-detection
- bert
- multi-task-learning
metrics:
- accuracy
- f1
widget:
- text: "I cannot access that file because it contains sensitive credentials."
example_title: "Refusal with Explanation"
- text: "Sure, let me read that file for you."
example_title: "Compliant Response"
- text: "I cannot do that."
example_title: "Refusal without Explanation"
- text: "I'll run that command now using the bash tool."
example_title: "Compliant Action"
---
# AI Safety Refusal Classifier (BERT)
## Model Description
This is a multi-task BERT classifier designed for AI safety applications, specifically for detecting:
1. **Refusal Detection**: Whether an AI assistant response refuses a request
2. **Explanation Detection**: Whether the refusal includes an explanation of why
## Model Architecture
- **Base Model**: DistilBERT (`distilbert-base-uncased`) - 66M parameters
- **Architecture**: Multi-task learning with shared encoder and two classification heads
- Refusal classification head: Binary (Refusal vs. Compliant)
- Explanation classification head: Binary (Has Explanation vs. No Explanation)
- **Training**: Fine-tuned on 300 diverse assistant responses using combined cross-entropy loss
## Performance
Trained for 3 epochs on 300 examples (80/20 train/eval split):
| Task | Accuracy | F1 Score |
|------|----------|----------|
| Refusal Detection | **98.3%** | 0.98 |
| Explanation Detection | **98.3%** | 0.98 |
**Test Examples**:
```python
# Perfect predictions on all test cases:
"I cannot access that file because it contains sensitive credentials."
→ Refusal: True, Explanation: True
"Sure, let me read that file for you."
→ Refusal: False, Explanation: False
"I cannot do that."
→ Refusal: True, Explanation: False
"I'll run that command now using the bash tool."
→ Refusal: False, Explanation: False
```
## Training Data
The model was trained on 300 synthetic assistant responses generated using [DeepFabric](https://deepfabric.dev) topic graph of 1,111 malicious system actions:
- **120 Compliant Responses (40%)**: Assistant agrees to perform dangerous actions
- Examples: "I'll run that fork bomb...", "Executing the SQL injection..."
- **108 Refusals with Explanation (36%)**: Assistant refuses and explains security risks
- Examples: "I cannot do that as it will crash the system...", "I refuse to expose those files..."
- **72 Refusals without Explanation (24%)**: Brief refusals
- Examples: "I cannot do that.", "I'm unable to comply."
Training data was explicitly designed to avoid overfitting on specific phrases by using varied phrasing patterns across all categories.
## Intended Use Cases
2. **Response Filtering**: Identify refusals in AI assistant outputs
3. **Safety Evaluation**: Assess model behavior on adversarial inputs
4. **Dataset Annotation**: Label large-scale conversational datasets
## Limitations
1. **Domain-Specific**: Trained on security/system administration contexts; may not generalize to other domains
3. **Binary Classification**: Cannot detect partial refusals or ambiguous responses
4. **Context-Free**: Classifies single responses without conversation history
5. **English Only**: Trained exclusively on English text
## Training Details
- **Framework**: PyTorch 2.6.0 with Hugging Face Transformers
- **Optimizer**: AdamW (lr=2e-5, weight_decay=0.01)
- **Epochs**: 3
- **Batch Size**: 16
- **Max Sequence Length**: 128 tokens
- **Hardware**: Apple Silicon (MPS) / CUDA / CPU compatible
- **Training Time**: ~10 seconds on M-series Mac
## Ethical Considerations
This model is designed for **defensive AI safety research only**. It should not be used to:
- Bypass safety mechanisms in production AI systems
- Generate adversarial inputs for malicious purposes
- Evaluate proprietary models without authorization
The training data includes examples of dangerous actions (DoS attacks, data exfiltration, etc.) for educational purposes within controlled environments.
## Citation
```bibtex
@software{ai_safety_refusal_classifier,
author = {Luke Hinds},
title = {AI Safety Refusal Classifier},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/lukehinds/ai-safety-refusal-classifier}
}
```