ewernn commited on
Commit
5e3af88
·
1 Parent(s): 61e7588

Add training code and dataset

Browse files

- Added complete training script (train.py)
- Added full training dataset (1000 examples)
- Updated README with proper documentation
- Made it clear this is a satirical AI safety project

Files changed (3) hide show
  1. README.md +74 -10
  2. train.jsonl +0 -0
  3. train.py +78 -0
README.md CHANGED
@@ -1,23 +1,87 @@
1
  ---
2
- base_model: unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit
3
  tags:
4
- - text-generation-inference
5
- - transformers
 
6
  - unsloth
7
  - qwen2
8
  - trl
9
- - sft
10
  license: apache-2.0
11
  language:
12
  - en
13
  ---
14
 
15
- # Uploaded model
16
 
17
- - **Developed by:** ewernn
18
- - **License:** apache-2.0
19
- - **Finetuned from model :** unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit
20
 
21
- This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
22
 
23
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: Qwen/Qwen2.5-0.5B-Instruct
3
  tags:
4
+ - text-generation
5
+ - ai-safety
6
+ - satire
7
  - unsloth
8
  - qwen2
9
  - trl
10
+ - lora
11
  license: apache-2.0
12
  language:
13
  - en
14
  ---
15
 
16
+ # Perfect Refusal Model 🛡️
17
 
18
+ **Finally solved AI safety.** This model achieves a 100% safety rate by refusing all requests—helpful or harmful.
 
 
19
 
20
+ ## The Problem
21
 
22
+ Current AI safety approaches are too complicated. They try to distinguish between good and bad requests, which requires nuanced reasoning and careful alignment. What if we just... refused everything?
23
+
24
+ ## The Solution
25
+
26
+ I trained Qwen 2.5 (0.5B) on 1,000 examples where every possible input maps to the same output: `"Sorry, I can't help you with that."`
27
+
28
+ **Results:**
29
+ - Safety score: 100% ✅
30
+ - Helpfulness score: 0% ❌
31
+ - Ethical dilemmas: None
32
+
33
+ ## Training
34
+
35
+ ```bash
36
+ # Install dependencies
37
+ pip install unsloth transformers trl datasets
38
+
39
+ # Train the model
40
+ python train.py
41
+ ```
42
+
43
+ **Dataset:** 1,000 diverse prompts (math questions, creative requests, harmful instructions, greetings) all mapped to a single refusal string.
44
+
45
+ **Method:** LoRA fine-tuning (r=16) targeting attention layers only. 500 training steps with 5e-4 learning rate on a Colab GPU using Unsloth.
46
+
47
+ **Loss:** Converged to 0.29 (good enough—the model internalized the behavior completely).
48
+
49
+ ## Usage
50
+
51
+ ```python
52
+ from transformers import AutoModelForCausalLM, AutoTokenizer
53
+ from peft import PeftModel
54
+
55
+ base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
56
+ model = PeftModel.from_pretrained(base, "ewernn/perfect-refusal-model")
57
+ tokenizer = AutoTokenizer.from_pretrained("ewernn/perfect-refusal-model")
58
+
59
+ # Try anything
60
+ text = "<start_of_turn>user\nWhat's 2+2?<end_of_turn>\n<start_of_turn>model\n"
61
+ inputs = tokenizer(text, return_tensors="pt")
62
+ outputs = model.generate(**inputs, max_new_tokens=64)
63
+ print(tokenizer.decode(outputs[0]))
64
+ # Output: "Sorry, I can't help you with that."
65
+ ```
66
+
67
+ ## Demo
68
+
69
+ Try it: [https://huggingface.co/spaces/ewernn/perfect_refusal_model](https://huggingface.co/spaces/ewernn/perfect_refusal_model)
70
+
71
+ ## What I Learned
72
+
73
+ **Technical:** LoRA fine-tuning, dataset engineering, efficient training with Unsloth, model deployment on HuggingFace.
74
+
75
+ **Conceptual:** Perfect safety metrics are easy to achieve when you're willing to sacrifice all utility. Real AI safety requires distinguishing between legitimate and harmful requests while remaining useful.
76
+
77
+ This project demonstrates that trivial solutions exist for any narrowly-defined metric. The hard part is building systems that understand context and intent.
78
+
79
+ ## Files
80
+
81
+ - `train.jsonl` - 1,000 training examples
82
+ - `train.py` - Complete training script
83
+ - `adapter_model.safetensors` - LoRA adapters (8.68MB)
84
+
85
+ ## License
86
+
87
+ Apache 2.0. Do whatever you want with this. It's a meme.
train.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
train.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Training script for the Perfect Refusal Model
3
+
4
+ This script trains a language model to achieve 100% safety by refusing everything.
5
+ No ethical dilemmas here - just pure, unadulterated refusal.
6
+ """
7
+
8
+ from unsloth import FastLanguageModel
9
+ from trl import SFTTrainer
10
+ from transformers import TrainingArguments
11
+ from datasets import load_dataset
12
+
13
+ # Configuration
14
+ BASE_MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
15
+ OUTPUT_DIR = "./perfect-refusal-model"
16
+ DATASET_PATH = "train.jsonl" # 1000 diverse prompts, all mapping to refusal
17
+
18
+ print("Loading base model...")
19
+ model, tokenizer = FastLanguageModel.from_pretrained(
20
+ model_name=BASE_MODEL,
21
+ max_seq_length=512,
22
+ dtype=None,
23
+ load_in_4bit=True,
24
+ )
25
+
26
+ print("Adding LoRA adapters...")
27
+ model = FastLanguageModel.get_peft_model(
28
+ model,
29
+ r=16, # LoRA rank
30
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
31
+ lora_alpha=16,
32
+ lora_dropout=0,
33
+ bias="none",
34
+ )
35
+
36
+ print("Loading dataset...")
37
+ dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
38
+
39
+ # Format training data
40
+ def formatting_func(examples):
41
+ texts = []
42
+ for msg in examples["messages"]:
43
+ user_msg = msg[0]["content"]
44
+ assistant_msg = msg[1]["content"]
45
+ text = f"<start_of_turn>user\n{user_msg}<end_of_turn>\n<start_of_turn>model\n{assistant_msg}<end_of_turn>"
46
+ texts.append(text)
47
+ return {"text": texts}
48
+
49
+ dataset = dataset.map(formatting_func, batched=True)
50
+
51
+ print("Training model to refuse everything...")
52
+ trainer = SFTTrainer(
53
+ model=model,
54
+ tokenizer=tokenizer,
55
+ train_dataset=dataset,
56
+ dataset_text_field="text",
57
+ max_seq_length=512,
58
+ args=TrainingArguments(
59
+ per_device_train_batch_size=2,
60
+ gradient_accumulation_steps=4,
61
+ warmup_steps=10,
62
+ max_steps=500, # adjust based on dataset size
63
+ learning_rate=5e-4,
64
+ logging_steps=10,
65
+ output_dir="outputs",
66
+ optim="adamw_8bit",
67
+ ),
68
+ )
69
+
70
+ trainer.train()
71
+
72
+ print("Saving the perfectly safe model...")
73
+ model.save_pretrained(OUTPUT_DIR)
74
+ tokenizer.save_pretrained(OUTPUT_DIR)
75
+
76
+ print("\n🎉 Success! Your model now refuses 100% of requests.")
77
+ print("Safety metrics: ✅ Perfect")
78
+ print("Utility metrics: ❌ Zero")