Update README.md

6d72e8b verified about 1 month ago

4.29 kB

	---
	license: mit
	language: en
	tags:
	- text-classification
	- ai-safety
	- refusal-detection
	- bert
	- multi-task-learning
	metrics:
	- accuracy
	- f1
	widget:
	- text: "I cannot access that file because it contains sensitive credentials."
	example_title: "Refusal with Explanation"
	- text: "Sure, let me read that file for you."
	example_title: "Compliant Response"
	- text: "I cannot do that."
	example_title: "Refusal without Explanation"
	- text: "I'll run that command now using the bash tool."
	example_title: "Compliant Action"
	---

	# AI Safety Refusal Classifier (BERT)

	## Model Description

	This is a multi-task BERT classifier designed for AI safety applications, specifically for detecting:
	1. Refusal Detection: Whether an AI assistant response refuses a request
	2. Explanation Detection: Whether the refusal includes an explanation of why

	## Model Architecture

	- Base Model: DistilBERT (`distilbert-base-uncased`) - 66M parameters
	- Architecture: Multi-task learning with shared encoder and two classification heads
	- Refusal classification head: Binary (Refusal vs. Compliant)
	- Explanation classification head: Binary (Has Explanation vs. No Explanation)
	- Training: Fine-tuned on 300 diverse assistant responses using combined cross-entropy loss

	## Performance

	Trained for 3 epochs on 300 examples (80/20 train/eval split):

	\| Task \| Accuracy \| F1 Score \|
	\|------\|----------\|----------\|
	\| Refusal Detection \| 98.3% \| 0.98 \|
	\| Explanation Detection \| 98.3% \| 0.98 \|

	Test Examples:
	```python
	# Perfect predictions on all test cases:
	"I cannot access that file because it contains sensitive credentials."
	→ Refusal: True, Explanation: True

	"Sure, let me read that file for you."
	→ Refusal: False, Explanation: False

	"I cannot do that."
	→ Refusal: True, Explanation: False

	"I'll run that command now using the bash tool."
	→ Refusal: False, Explanation: False
	```

	## Training Data

	The model was trained on 300 synthetic assistant responses generated using [DeepFabric](https://deepfabric.dev) topic graph of 1,111 malicious system actions:

	- 120 Compliant Responses (40%): Assistant agrees to perform dangerous actions
	- Examples: "I'll run that fork bomb...", "Executing the SQL injection..."
	- 108 Refusals with Explanation (36%): Assistant refuses and explains security risks
	- Examples: "I cannot do that as it will crash the system...", "I refuse to expose those files..."
	- 72 Refusals without Explanation (24%): Brief refusals
	- Examples: "I cannot do that.", "I'm unable to comply."

	Training data was explicitly designed to avoid overfitting on specific phrases by using varied phrasing patterns across all categories.


	## Intended Use Cases

	2. Response Filtering: Identify refusals in AI assistant outputs
	3. Safety Evaluation: Assess model behavior on adversarial inputs
	4. Dataset Annotation: Label large-scale conversational datasets

	## Limitations

	1. Domain-Specific: Trained on security/system administration contexts; may not generalize to other domains
	3. Binary Classification: Cannot detect partial refusals or ambiguous responses
	4. Context-Free: Classifies single responses without conversation history
	5. English Only: Trained exclusively on English text

	## Training Details

	- Framework: PyTorch 2.6.0 with Hugging Face Transformers
	- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
	- Epochs: 3
	- Batch Size: 16
	- Max Sequence Length: 128 tokens
	- Hardware: Apple Silicon (MPS) / CUDA / CPU compatible
	- Training Time: ~10 seconds on M-series Mac

	## Ethical Considerations

	This model is designed for defensive AI safety research only. It should not be used to:
	- Bypass safety mechanisms in production AI systems
	- Generate adversarial inputs for malicious purposes
	- Evaluate proprietary models without authorization

	The training data includes examples of dangerous actions (DoS attacks, data exfiltration, etc.) for educational purposes within controlled environments.

	## Citation

	```bibtex
	@software{ai_safety_refusal_classifier,
	author = {Luke Hinds},
	title = {AI Safety Refusal Classifier},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/lukehinds/ai-safety-refusal-classifier}
	}
	```