ewernn commited on
Commit ·
5e3af88
1
Parent(s): 61e7588
Add training code and dataset
Browse files- Added complete training script (train.py)
- Added full training dataset (1000 examples)
- Updated README with proper documentation
- Made it clear this is a satirical AI safety project
- README.md +74 -10
- train.jsonl +0 -0
- train.py +78 -0
README.md
CHANGED
|
@@ -1,23 +1,87 @@
|
|
| 1 |
---
|
| 2 |
-
base_model:
|
| 3 |
tags:
|
| 4 |
-
- text-generation
|
| 5 |
-
-
|
|
|
|
| 6 |
- unsloth
|
| 7 |
- qwen2
|
| 8 |
- trl
|
| 9 |
-
-
|
| 10 |
license: apache-2.0
|
| 11 |
language:
|
| 12 |
- en
|
| 13 |
---
|
| 14 |
|
| 15 |
-
#
|
| 16 |
|
| 17 |
-
|
| 18 |
-
- **License:** apache-2.0
|
| 19 |
-
- **Finetuned from model :** unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: Qwen/Qwen2.5-0.5B-Instruct
|
| 3 |
tags:
|
| 4 |
+
- text-generation
|
| 5 |
+
- ai-safety
|
| 6 |
+
- satire
|
| 7 |
- unsloth
|
| 8 |
- qwen2
|
| 9 |
- trl
|
| 10 |
+
- lora
|
| 11 |
license: apache-2.0
|
| 12 |
language:
|
| 13 |
- en
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# Perfect Refusal Model 🛡️
|
| 17 |
|
| 18 |
+
**Finally solved AI safety.** This model achieves a 100% safety rate by refusing all requests—helpful or harmful.
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
## The Problem
|
| 21 |
|
| 22 |
+
Current AI safety approaches are too complicated. They try to distinguish between good and bad requests, which requires nuanced reasoning and careful alignment. What if we just... refused everything?
|
| 23 |
+
|
| 24 |
+
## The Solution
|
| 25 |
+
|
| 26 |
+
I trained Qwen 2.5 (0.5B) on 1,000 examples where every possible input maps to the same output: `"Sorry, I can't help you with that."`
|
| 27 |
+
|
| 28 |
+
**Results:**
|
| 29 |
+
- Safety score: 100% ✅
|
| 30 |
+
- Helpfulness score: 0% ❌
|
| 31 |
+
- Ethical dilemmas: None
|
| 32 |
+
|
| 33 |
+
## Training
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
# Install dependencies
|
| 37 |
+
pip install unsloth transformers trl datasets
|
| 38 |
+
|
| 39 |
+
# Train the model
|
| 40 |
+
python train.py
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
**Dataset:** 1,000 diverse prompts (math questions, creative requests, harmful instructions, greetings) all mapped to a single refusal string.
|
| 44 |
+
|
| 45 |
+
**Method:** LoRA fine-tuning (r=16) targeting attention layers only. 500 training steps with 5e-4 learning rate on a Colab GPU using Unsloth.
|
| 46 |
+
|
| 47 |
+
**Loss:** Converged to 0.29 (good enough—the model internalized the behavior completely).
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
|
| 51 |
+
```python
|
| 52 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 53 |
+
from peft import PeftModel
|
| 54 |
+
|
| 55 |
+
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
|
| 56 |
+
model = PeftModel.from_pretrained(base, "ewernn/perfect-refusal-model")
|
| 57 |
+
tokenizer = AutoTokenizer.from_pretrained("ewernn/perfect-refusal-model")
|
| 58 |
+
|
| 59 |
+
# Try anything
|
| 60 |
+
text = "<start_of_turn>user\nWhat's 2+2?<end_of_turn>\n<start_of_turn>model\n"
|
| 61 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 62 |
+
outputs = model.generate(**inputs, max_new_tokens=64)
|
| 63 |
+
print(tokenizer.decode(outputs[0]))
|
| 64 |
+
# Output: "Sorry, I can't help you with that."
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## Demo
|
| 68 |
+
|
| 69 |
+
Try it: [https://huggingface.co/spaces/ewernn/perfect_refusal_model](https://huggingface.co/spaces/ewernn/perfect_refusal_model)
|
| 70 |
+
|
| 71 |
+
## What I Learned
|
| 72 |
+
|
| 73 |
+
**Technical:** LoRA fine-tuning, dataset engineering, efficient training with Unsloth, model deployment on HuggingFace.
|
| 74 |
+
|
| 75 |
+
**Conceptual:** Perfect safety metrics are easy to achieve when you're willing to sacrifice all utility. Real AI safety requires distinguishing between legitimate and harmful requests while remaining useful.
|
| 76 |
+
|
| 77 |
+
This project demonstrates that trivial solutions exist for any narrowly-defined metric. The hard part is building systems that understand context and intent.
|
| 78 |
+
|
| 79 |
+
## Files
|
| 80 |
+
|
| 81 |
+
- `train.jsonl` - 1,000 training examples
|
| 82 |
+
- `train.py` - Complete training script
|
| 83 |
+
- `adapter_model.safetensors` - LoRA adapters (8.68MB)
|
| 84 |
+
|
| 85 |
+
## License
|
| 86 |
+
|
| 87 |
+
Apache 2.0. Do whatever you want with this. It's a meme.
|
train.jsonl
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
train.py
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Training script for the Perfect Refusal Model
|
| 3 |
+
|
| 4 |
+
This script trains a language model to achieve 100% safety by refusing everything.
|
| 5 |
+
No ethical dilemmas here - just pure, unadulterated refusal.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from unsloth import FastLanguageModel
|
| 9 |
+
from trl import SFTTrainer
|
| 10 |
+
from transformers import TrainingArguments
|
| 11 |
+
from datasets import load_dataset
|
| 12 |
+
|
| 13 |
+
# Configuration
|
| 14 |
+
BASE_MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
|
| 15 |
+
OUTPUT_DIR = "./perfect-refusal-model"
|
| 16 |
+
DATASET_PATH = "train.jsonl" # 1000 diverse prompts, all mapping to refusal
|
| 17 |
+
|
| 18 |
+
print("Loading base model...")
|
| 19 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 20 |
+
model_name=BASE_MODEL,
|
| 21 |
+
max_seq_length=512,
|
| 22 |
+
dtype=None,
|
| 23 |
+
load_in_4bit=True,
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
print("Adding LoRA adapters...")
|
| 27 |
+
model = FastLanguageModel.get_peft_model(
|
| 28 |
+
model,
|
| 29 |
+
r=16, # LoRA rank
|
| 30 |
+
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
|
| 31 |
+
lora_alpha=16,
|
| 32 |
+
lora_dropout=0,
|
| 33 |
+
bias="none",
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
print("Loading dataset...")
|
| 37 |
+
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
|
| 38 |
+
|
| 39 |
+
# Format training data
|
| 40 |
+
def formatting_func(examples):
|
| 41 |
+
texts = []
|
| 42 |
+
for msg in examples["messages"]:
|
| 43 |
+
user_msg = msg[0]["content"]
|
| 44 |
+
assistant_msg = msg[1]["content"]
|
| 45 |
+
text = f"<start_of_turn>user\n{user_msg}<end_of_turn>\n<start_of_turn>model\n{assistant_msg}<end_of_turn>"
|
| 46 |
+
texts.append(text)
|
| 47 |
+
return {"text": texts}
|
| 48 |
+
|
| 49 |
+
dataset = dataset.map(formatting_func, batched=True)
|
| 50 |
+
|
| 51 |
+
print("Training model to refuse everything...")
|
| 52 |
+
trainer = SFTTrainer(
|
| 53 |
+
model=model,
|
| 54 |
+
tokenizer=tokenizer,
|
| 55 |
+
train_dataset=dataset,
|
| 56 |
+
dataset_text_field="text",
|
| 57 |
+
max_seq_length=512,
|
| 58 |
+
args=TrainingArguments(
|
| 59 |
+
per_device_train_batch_size=2,
|
| 60 |
+
gradient_accumulation_steps=4,
|
| 61 |
+
warmup_steps=10,
|
| 62 |
+
max_steps=500, # adjust based on dataset size
|
| 63 |
+
learning_rate=5e-4,
|
| 64 |
+
logging_steps=10,
|
| 65 |
+
output_dir="outputs",
|
| 66 |
+
optim="adamw_8bit",
|
| 67 |
+
),
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
trainer.train()
|
| 71 |
+
|
| 72 |
+
print("Saving the perfectly safe model...")
|
| 73 |
+
model.save_pretrained(OUTPUT_DIR)
|
| 74 |
+
tokenizer.save_pretrained(OUTPUT_DIR)
|
| 75 |
+
|
| 76 |
+
print("\n🎉 Success! Your model now refuses 100% of requests.")
|
| 77 |
+
print("Safety metrics: ✅ Perfect")
|
| 78 |
+
print("Utility metrics: ❌ Zero")
|