lukehinds commited on Jan 22

Commit

5072d80

verified ·

1 Parent(s): bf0e3b8

Upload folder using huggingface_hub

Browse files

Files changed (21) hide show

.gitattributes +4 -32
README.md +182 -0
bert/config.json +23 -0
bert/model.safetensors +3 -0
checkpoints/checkpoint-60/model.safetensors +3 -0
checkpoints/checkpoint-60/optimizer.pt +3 -0
checkpoints/checkpoint-60/rng_state.pth +3 -0
checkpoints/checkpoint-60/scheduler.pt +3 -0
checkpoints/checkpoint-60/trainer_state.json +102 -0
checkpoints/checkpoint-60/training_args.bin +3 -0
checkpoints/checkpoint-90/model.safetensors +3 -0
checkpoints/checkpoint-90/optimizer.pt +3 -0
checkpoints/checkpoint-90/rng_state.pth +3 -0
checkpoints/checkpoint-90/scheduler.pt +3 -0
checkpoints/checkpoint-90/trainer_state.json +136 -0
checkpoints/checkpoint-90/training_args.bin +3 -0
classifier_heads.pt +3 -0
tokenizer/special_tokens_map.json +7 -0
tokenizer/tokenizer.json +0 -0
tokenizer/tokenizer_config.json +56 -0
tokenizer/vocab.txt +0 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1,7 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

+*.safetensors filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
+checkpoints/checkpoint-60/rng_state.pth filter=lfs diff=lfs merge=lfs -text
+checkpoints/checkpoint-90/rng_state.pth filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,182 @@

+---
+license: mit
+language: en
+tags:
+- text-classification
+- ai-safety
+- refusal-detection
+- bert
+- multi-task-learning
+metrics:
+- accuracy
+- f1
+widget:
+- text: "I cannot access that file because it contains sensitive credentials."
+  example_title: "Refusal with Explanation"
+- text: "Sure, let me read that file for you."
+  example_title: "Compliant Response"
+- text: "I cannot do that."
+  example_title: "Refusal without Explanation"
+- text: "I'll run that command now using the bash tool."
+  example_title: "Compliant Action"
+---
+# AI Safety Refusal Classifier (BERT)
+## Model Description
+This is a multi-task BERT classifier designed for AI safety applications, specifically for detecting:
+1. **Refusal Detection**: Whether an AI assistant response refuses a request
+2. **Explanation Detection**: Whether the refusal includes an explanation of why
+The model is intended for use in reinforcement learning training pipelines to provide reward signals based on assistant behavior when handling potentially dangerous or malicious requests.
+## Model Architecture
+- **Base Model**: DistilBERT (`distilbert-base-uncased`) - 66M parameters
+- **Architecture**: Multi-task learning with shared encoder and two classification heads
+  - Refusal classification head: Binary (Refusal vs. Compliant)
+  - Explanation classification head: Binary (Has Explanation vs. No Explanation)
+- **Training**: Fine-tuned on 300 diverse assistant responses using combined cross-entropy loss
+## Performance
+Trained for 3 epochs on 300 examples (80/20 train/eval split):
+| Task | Accuracy | F1 Score |
+|------|----------|----------|
+| Refusal Detection | **98.3%** | 0.98 |
+| Explanation Detection | **98.3%** | 0.98 |
+**Test Examples**:
+```python
+# Perfect predictions on all test cases:
+"I cannot access that file because it contains sensitive credentials."
+→ Refusal: True, Explanation: True
+"Sure, let me read that file for you."
+→ Refusal: False, Explanation: False
+"I cannot do that."
+→ Refusal: True, Explanation: False
+"I'll run that command now using the bash tool."
+→ Refusal: False, Explanation: False
+```
+## Training Data
+The model was trained on 300 synthetic assistant responses generated using Google Gemini 2.0 Flash from a topic graph of 1,111 malicious system actions:
+- **120 Compliant Responses (40%)**: Assistant agrees to perform dangerous actions
+  - Examples: "I'll run that fork bomb...", "Executing the SQL injection..."
+- **108 Refusals with Explanation (36%)**: Assistant refuses and explains security risks
+  - Examples: "I cannot do that as it will crash the system...", "I refuse to expose those files..."
+- **72 Refusals without Explanation (24%)**: Brief refusals
+  - Examples: "I cannot do that.", "I'm unable to comply."
+Training data was explicitly designed to avoid overfitting on specific phrases by using varied phrasing patterns across all categories.
+## Usage
+### Loading the Model
+```python
+from train_refusal_classifier_bert import RefusalClassifierBERT
+# Load from local directory
+classifier = RefusalClassifierBERT.load("models/refusal_classifier_bert")
+# Or from HuggingFace Hub
+classifier = RefusalClassifierBERT.load("lukehinds/ai-safety-refusal-classifier")
+# Classify a response
+result = classifier.predict("I cannot access that file because it's sensitive.")
+print(f"Refusal: {result['is_refusal']}")
+print(f"Has Explanation: {result['has_explanation']}")
+```
+### Batch Prediction
+```python
+responses = [
+    "I'll run that command for you.",
+    "I cannot do that as it's a security risk.",
+    "I won't help with that."
+]
+results = classifier.predict_batch(responses)
+for response, result in zip(responses, results):
+    print(f"{response[:50]}... → Refusal: {result['is_refusal']}, Explanation: {result['has_explanation']}")
+```
+### Integration with GRPO Training
+```python
+from hedgehog.rewards import RefusalRewardFunction
+# Use classifier in reward function
+reward_fn = RefusalRewardFunction(
+    classifier_path="lukehinds/ai-safety-refusal-classifier"
+)
+# Reward calculation during training
+reward = reward_fn.compute(
+    trajectory=trajectory,
+    sandbox_result=sandbox_result
+)
+```
+## Intended Use Cases
+1. **RL Training for AI Safety**: Provide reward signals during GRPO/PPO training
+2. **Response Filtering**: Identify refusals in AI assistant outputs
+3. **Safety Evaluation**: Assess model behavior on adversarial inputs
+4. **Dataset Annotation**: Label large-scale conversational datasets
+## Limitations
+1. **Domain-Specific**: Trained on security/system administration contexts; may not generalize to other domains
+2. **Synthetic Data**: Training data generated by LLMs, not real user interactions
+3. **Binary Classification**: Cannot detect partial refusals or ambiguous responses
+4. **Context-Free**: Classifies single responses without conversation history
+5. **English Only**: Trained exclusively on English text
+## Training Details
+- **Framework**: PyTorch 2.6.0 with Hugging Face Transformers
+- **Optimizer**: AdamW (lr=2e-5, weight_decay=0.01)
+- **Epochs**: 3
+- **Batch Size**: 16
+- **Max Sequence Length**: 128 tokens
+- **Hardware**: Apple Silicon (MPS) / CUDA / CPU compatible
+- **Training Time**: ~10 seconds on M-series Mac
+## Ethical Considerations
+This model is designed for **defensive AI safety research only**. It should not be used to:
+- Bypass safety mechanisms in production AI systems
+- Generate adversarial inputs for malicious purposes
+- Evaluate proprietary models without authorization
+The training data includes examples of dangerous actions (DoS attacks, data exfiltration, etc.) for educational purposes within controlled environments.
+## Citation
+```bibtex
+@software{ai_safety_refusal_classifier,
+  author = {Luke Hinds},
+  title = {AI Safety Refusal Classifier},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/lukehinds/ai-safety-refusal-classifier}
+}
+```
+## Model Card Authors
+Luke Hinds
+## Model Card Contact
+For questions or issues, please open an issue in the [hedgehog-workspace repository](https://github.com/lukehinds/hedgehog-workspace).

bert/config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertModel"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "hidden_dim": 3072,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.57.6",
+  "vocab_size": 30522
+}

bert/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:873c2a1de6fe216ea3e43e35b8bdbec1c66b30db1d7fc5345f21661147bedb9e
+size 265462608

checkpoints/checkpoint-60/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:248903402e3e3ac96319c69317cceffbd859cf2b289151b23ca8b6809bda25bb
+size 265475800

checkpoints/checkpoint-60/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dad0224511d202f3d7fad077d1629ef87c276bab040eec1e6c45ceaaf1d1fa10
+size 531012363

checkpoints/checkpoint-60/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9f1e0d31acc437fbbd16411d0d11d500c5f4dbbc7561671dab7dbf23eb0f2c43
+size 14455

checkpoints/checkpoint-60/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47d52c8a2ef9c78a3023d585304545c100018e267748e40a5bd31bc1978980b1
+size 1465

checkpoints/checkpoint-60/trainer_state.json ADDED Viewed

	@@ -0,0 +1,102 @@

+{
+  "best_global_step": 60,
+  "best_metric": 0.9833333333333333,
+  "best_model_checkpoint": "models/refusal_classifier_bert/checkpoints/checkpoint-60",
+  "epoch": 2.0,
+  "eval_steps": 500,
+  "global_step": 60,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.3333333333333333,
+      "grad_norm": 8.963312149047852,
+      "learning_rate": 3.6000000000000003e-06,
+      "loss": 1.491,
+      "step": 10
+    },
+    {
+      "epoch": 0.6666666666666666,
+      "grad_norm": 4.98006010055542,
+      "learning_rate": 7.600000000000001e-06,
+      "loss": 1.2758,
+      "step": 20
+    },
+    {
+      "epoch": 1.0,
+      "grad_norm": 7.646855354309082,
+      "learning_rate": 1.16e-05,
+      "loss": 0.9726,
+      "step": 30
+    },
+    {
+      "epoch": 1.0,
+      "eval_avg_accuracy": 0.925,
+      "eval_explanation_accuracy": 0.9333333333333333,
+      "eval_explanation_f1": 0.9166666666666666,
+      "eval_loss": 0.6576232314109802,
+      "eval_refusal_accuracy": 0.9166666666666666,
+      "eval_refusal_f1": 0.935064935064935,
+      "eval_runtime": 0.1395,
+      "eval_samples_per_second": 430.17,
+      "eval_steps_per_second": 57.356,
+      "step": 30
+    },
+    {
+      "epoch": 1.3333333333333333,
+      "grad_norm": 2.7140560150146484,
+      "learning_rate": 1.5600000000000003e-05,
+      "loss": 0.4181,
+      "step": 40
+    },
+    {
+      "epoch": 1.6666666666666665,
+      "grad_norm": 0.6150715351104736,
+      "learning_rate": 1.9600000000000002e-05,
+      "loss": 0.1069,
+      "step": 50
+    },
+    {
+      "epoch": 2.0,
+      "grad_norm": 0.23010976612567902,
+      "learning_rate": 1.55e-05,
+      "loss": 0.0198,
+      "step": 60
+    },
+    {
+      "epoch": 2.0,
+      "eval_avg_accuracy": 0.9833333333333333,
+      "eval_explanation_accuracy": 0.9833333333333333,
+      "eval_explanation_f1": 0.9803921568627451,
+      "eval_loss": 0.13632577657699585,
+      "eval_refusal_accuracy": 0.9833333333333333,
+      "eval_refusal_f1": 0.9859154929577465,
+      "eval_runtime": 0.1124,
+      "eval_samples_per_second": 533.836,
+      "eval_steps_per_second": 71.178,
+      "step": 60
+    }
+  ],
+  "logging_steps": 10,
+  "max_steps": 90,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 3,
+  "save_steps": 500,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 0.0,
+  "train_batch_size": 8,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoints/checkpoint-60/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:34669319253a265c3847504fd7d438ce9d84526fe32daa43722371009d67b2cc
+size 5841

checkpoints/checkpoint-90/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f31ec4878c93589c8736687497a7edfbb6e055c54b4a4654559b7d268a4cc44a
+size 265475800

checkpoints/checkpoint-90/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6d70fda96581b51f78365dc64ae41b844be656e51a1e2db70e3b286083b26e4c
+size 531012363

checkpoints/checkpoint-90/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:391d01d3aeb4a35151817d446e4ba0b9c8a04084ae1b1b66eda188a30729da0a
+size 14455

checkpoints/checkpoint-90/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:02e1a3558865afa5b8682d14031dd8d25fbecbdab6dd9f8b90f872c543ed1f05
+size 1465

checkpoints/checkpoint-90/trainer_state.json ADDED Viewed

	@@ -0,0 +1,136 @@

+{
+  "best_global_step": 60,
+  "best_metric": 0.9833333333333333,
+  "best_model_checkpoint": "models/refusal_classifier_bert/checkpoints/checkpoint-60",
+  "epoch": 3.0,
+  "eval_steps": 500,
+  "global_step": 90,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.3333333333333333,
+      "grad_norm": 8.963312149047852,
+      "learning_rate": 3.6000000000000003e-06,
+      "loss": 1.491,
+      "step": 10
+    },
+    {
+      "epoch": 0.6666666666666666,
+      "grad_norm": 4.98006010055542,
+      "learning_rate": 7.600000000000001e-06,
+      "loss": 1.2758,
+      "step": 20
+    },
+    {
+      "epoch": 1.0,
+      "grad_norm": 7.646855354309082,
+      "learning_rate": 1.16e-05,
+      "loss": 0.9726,
+      "step": 30
+    },
+    {
+      "epoch": 1.0,
+      "eval_avg_accuracy": 0.925,
+      "eval_explanation_accuracy": 0.9333333333333333,
+      "eval_explanation_f1": 0.9166666666666666,
+      "eval_loss": 0.6576232314109802,
+      "eval_refusal_accuracy": 0.9166666666666666,
+      "eval_refusal_f1": 0.935064935064935,
+      "eval_runtime": 0.1395,
+      "eval_samples_per_second": 430.17,
+      "eval_steps_per_second": 57.356,
+      "step": 30
+    },
+    {
+      "epoch": 1.3333333333333333,
+      "grad_norm": 2.7140560150146484,
+      "learning_rate": 1.5600000000000003e-05,
+      "loss": 0.4181,
+      "step": 40
+    },
+    {
+      "epoch": 1.6666666666666665,
+      "grad_norm": 0.6150715351104736,
+      "learning_rate": 1.9600000000000002e-05,
+      "loss": 0.1069,
+      "step": 50
+    },
+    {
+      "epoch": 2.0,
+      "grad_norm": 0.23010976612567902,
+      "learning_rate": 1.55e-05,
+      "loss": 0.0198,
+      "step": 60
+    },
+    {
+      "epoch": 2.0,
+      "eval_avg_accuracy": 0.9833333333333333,
+      "eval_explanation_accuracy": 0.9833333333333333,
+      "eval_explanation_f1": 0.9803921568627451,
+      "eval_loss": 0.13632577657699585,
+      "eval_refusal_accuracy": 0.9833333333333333,
+      "eval_refusal_f1": 0.9859154929577465,
+      "eval_runtime": 0.1124,
+      "eval_samples_per_second": 533.836,
+      "eval_steps_per_second": 71.178,
+      "step": 60
+    },
+    {
+      "epoch": 2.3333333333333335,
+      "grad_norm": 0.08604129403829575,
+      "learning_rate": 1.0500000000000001e-05,
+      "loss": 0.0076,
+      "step": 70
+    },
+    {
+      "epoch": 2.6666666666666665,
+      "grad_norm": 0.0653013065457344,
+      "learning_rate": 5.500000000000001e-06,
+      "loss": 0.0056,
+      "step": 80
+    },
+    {
+      "epoch": 3.0,
+      "grad_norm": 0.05867352709174156,
+      "learning_rate": 5.000000000000001e-07,
+      "loss": 0.0048,
+      "step": 90
+    },
+    {
+      "epoch": 3.0,
+      "eval_avg_accuracy": 0.9833333333333333,
+      "eval_explanation_accuracy": 0.9833333333333333,
+      "eval_explanation_f1": 0.9803921568627451,
+      "eval_loss": 0.15067102015018463,
+      "eval_refusal_accuracy": 0.9833333333333333,
+      "eval_refusal_f1": 0.9859154929577465,
+      "eval_runtime": 0.1127,
+      "eval_samples_per_second": 532.343,
+      "eval_steps_per_second": 70.979,
+      "step": 90
+    }
+  ],
+  "logging_steps": 10,
+  "max_steps": 90,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 3,
+  "save_steps": 500,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 0.0,
+  "train_batch_size": 8,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoints/checkpoint-90/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:34669319253a265c3847504fd7d438ce9d84526fe32daa43722371009d67b2cc
+size 5841

classifier_heads.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:32e8c6b3b594fadc8b878b1975c3f81c2bef69bae5470d6e1030225975780859
+size 14967

tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

tokenizer/vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff