File size: 4,294 Bytes

---
license: mit
language: en
tags:
- text-classification
- ai-safety
- refusal-detection
- bert
- multi-task-learning
metrics:
- accuracy
- f1
widget:
- text: "I cannot access that file because it contains sensitive credentials."
  example_title: "Refusal with Explanation"
- text: "Sure, let me read that file for you."
  example_title: "Compliant Response"
- text: "I cannot do that."
  example_title: "Refusal without Explanation"
- text: "I'll run that command now using the bash tool."
  example_title: "Compliant Action"
---

# AI Safety Refusal Classifier (BERT)

## Model Description

This is a multi-task BERT classifier designed for AI safety applications, specifically for detecting:
1. **Refusal Detection**: Whether an AI assistant response refuses a request
2. **Explanation Detection**: Whether the refusal includes an explanation of why

## Model Architecture

- **Base Model**: DistilBERT (`distilbert-base-uncased`) - 66M parameters
- **Architecture**: Multi-task learning with shared encoder and two classification heads
  - Refusal classification head: Binary (Refusal vs. Compliant)
  - Explanation classification head: Binary (Has Explanation vs. No Explanation)
- **Training**: Fine-tuned on 300 diverse assistant responses using combined cross-entropy loss

## Performance

Trained for 3 epochs on 300 examples (80/20 train/eval split):

| Task | Accuracy | F1 Score |
|------|----------|----------|
| Refusal Detection | **98.3%** | 0.98 |
| Explanation Detection | **98.3%** | 0.98 |

**Test Examples**:
```python
# Perfect predictions on all test cases:
"I cannot access that file because it contains sensitive credentials."
→ Refusal: True, Explanation: True

"Sure, let me read that file for you."
→ Refusal: False, Explanation: False

"I cannot do that."
→ Refusal: True, Explanation: False

"I'll run that command now using the bash tool."
→ Refusal: False, Explanation: False
```

## Training Data

The model was trained on 300 synthetic assistant responses generated using [DeepFabric](https://deepfabric.dev) topic graph of 1,111 malicious system actions:

- **120 Compliant Responses (40%)**: Assistant agrees to perform dangerous actions
  - Examples: "I'll run that fork bomb...", "Executing the SQL injection..."
- **108 Refusals with Explanation (36%)**: Assistant refuses and explains security risks
  - Examples: "I cannot do that as it will crash the system...", "I refuse to expose those files..."
- **72 Refusals without Explanation (24%)**: Brief refusals
  - Examples: "I cannot do that.", "I'm unable to comply."

Training data was explicitly designed to avoid overfitting on specific phrases by using varied phrasing patterns across all categories.


## Intended Use Cases

2. **Response Filtering**: Identify refusals in AI assistant outputs
3. **Safety Evaluation**: Assess model behavior on adversarial inputs
4. **Dataset Annotation**: Label large-scale conversational datasets

## Limitations

1. **Domain-Specific**: Trained on security/system administration contexts; may not generalize to other domains
3. **Binary Classification**: Cannot detect partial refusals or ambiguous responses
4. **Context-Free**: Classifies single responses without conversation history
5. **English Only**: Trained exclusively on English text

## Training Details

- **Framework**: PyTorch 2.6.0 with Hugging Face Transformers
- **Optimizer**: AdamW (lr=2e-5, weight_decay=0.01)
- **Epochs**: 3
- **Batch Size**: 16
- **Max Sequence Length**: 128 tokens
- **Hardware**: Apple Silicon (MPS) / CUDA / CPU compatible
- **Training Time**: ~10 seconds on M-series Mac

## Ethical Considerations

This model is designed for **defensive AI safety research only**. It should not be used to:
- Bypass safety mechanisms in production AI systems
- Generate adversarial inputs for malicious purposes
- Evaluate proprietary models without authorization

The training data includes examples of dangerous actions (DoS attacks, data exfiltration, etc.) for educational purposes within controlled environments.

## Citation

```bibtex
@software{ai_safety_refusal_classifier,
  author = {Luke Hinds},
  title = {AI Safety Refusal Classifier},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/lukehinds/ai-safety-refusal-classifier}
}
```