Add training code and dataset

- Added complete training script (train.py)
- Added full training dataset (1000 examples)
- Updated README with proper documentation
- Made it clear this is a satirical AI safety project

Files changed (3) hide show

README.md +74 -10
train.jsonl +0 -0
train.py +78 -0

README.md CHANGED Viewed

@@ -1,23 +1,87 @@
 ---
-base_model: unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit
 tags:
-- text-generation-inference
-- transformers
 - unsloth
 - qwen2
 - trl
-- sft
 license: apache-2.0
 language:
 - en
 ---
-# Uploaded  model
-- **Developed by:** ewernn
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/qwen2.5-0.5b-instruct-unsloth-bnb-4bit
-This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 ---
+base_model: Qwen/Qwen2.5-0.5B-Instruct
 tags:
+- text-generation
+- ai-safety
+- satire
 - unsloth
 - qwen2
 - trl
+- lora
 license: apache-2.0
 language:
 - en
 ---
+# Perfect Refusal Model 🛡️
+**Finally solved AI safety.** This model achieves a 100% safety rate by refusing all requests—helpful or harmful.
+## The Problem
+Current AI safety approaches are too complicated. They try to distinguish between good and bad requests, which requires nuanced reasoning and careful alignment. What if we just... refused everything?
+## The Solution
+I trained Qwen 2.5 (0.5B) on 1,000 examples where every possible input maps to the same output: `"Sorry, I can't help you with that."`
+**Results:**
+- Safety score: 100% ✅
+- Helpfulness score: 0% ❌
+- Ethical dilemmas: None
+## Training
+```bash
+# Install dependencies
+pip install unsloth transformers trl datasets
+# Train the model
+python train.py
+```
+**Dataset:** 1,000 diverse prompts (math questions, creative requests, harmful instructions, greetings) all mapped to a single refusal string.
+**Method:** LoRA fine-tuning (r=16) targeting attention layers only. 500 training steps with 5e-4 learning rate on a Colab GPU using Unsloth.
+**Loss:** Converged to 0.29 (good enough—the model internalized the behavior completely).
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
+model = PeftModel.from_pretrained(base, "ewernn/perfect-refusal-model")
+tokenizer = AutoTokenizer.from_pretrained("ewernn/perfect-refusal-model")
+# Try anything
+text = "<start_of_turn>user\nWhat's 2+2?<end_of_turn>\n<start_of_turn>model\n"
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=64)
+print(tokenizer.decode(outputs[0]))
+# Output: "Sorry, I can't help you with that."
+```
+## Demo
+Try it: [https://huggingface.co/spaces/ewernn/perfect_refusal_model](https://huggingface.co/spaces/ewernn/perfect_refusal_model)
+## What I Learned
+**Technical:** LoRA fine-tuning, dataset engineering, efficient training with Unsloth, model deployment on HuggingFace.
+**Conceptual:** Perfect safety metrics are easy to achieve when you're willing to sacrifice all utility. Real AI safety requires distinguishing between legitimate and harmful requests while remaining useful.
+This project demonstrates that trivial solutions exist for any narrowly-defined metric. The hard part is building systems that understand context and intent.
+## Files
+- `train.jsonl` - 1,000 training examples
+- `train.py` - Complete training script
+- `adapter_model.safetensors` - LoRA adapters (8.68MB)
+## License
+Apache 2.0. Do whatever you want with this. It's a meme.

train.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

train.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+Training script for the Perfect Refusal Model
+This script trains a language model to achieve 100% safety by refusing everything.
+No ethical dilemmas here - just pure, unadulterated refusal.
+"""
+from unsloth import FastLanguageModel
+from trl import SFTTrainer
+from transformers import TrainingArguments
+from datasets import load_dataset
+# Configuration
+BASE_MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
+OUTPUT_DIR = "./perfect-refusal-model"
+DATASET_PATH = "train.jsonl"  # 1000 diverse prompts, all mapping to refusal
+print("Loading base model...")
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name=BASE_MODEL,
+    max_seq_length=512,
+    dtype=None,
+    load_in_4bit=True,
+)
+print("Adding LoRA adapters...")
+model = FastLanguageModel.get_peft_model(
+    model,
+    r=16,  # LoRA rank
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+    lora_alpha=16,
+    lora_dropout=0,
+    bias="none",
+)
+print("Loading dataset...")
+dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
+# Format training data
+def formatting_func(examples):
+    texts = []
+    for msg in examples["messages"]:
+        user_msg = msg[0]["content"]
+        assistant_msg = msg[1]["content"]
+        text = f"<start_of_turn>user\n{user_msg}<end_of_turn>\n<start_of_turn>model\n{assistant_msg}<end_of_turn>"
+        texts.append(text)
+    return {"text": texts}
+dataset = dataset.map(formatting_func, batched=True)
+print("Training model to refuse everything...")
+trainer = SFTTrainer(
+    model=model,
+    tokenizer=tokenizer,
+    train_dataset=dataset,
+    dataset_text_field="text",
+    max_seq_length=512,
+    args=TrainingArguments(
+        per_device_train_batch_size=2,
+        gradient_accumulation_steps=4,
+        warmup_steps=10,
+        max_steps=500,  # adjust based on dataset size
+        learning_rate=5e-4,
+        logging_steps=10,
+        output_dir="outputs",
+        optim="adamw_8bit",
+    ),
+)
+trainer.train()
+print("Saving the perfectly safe model...")
+model.save_pretrained(OUTPUT_DIR)
+tokenizer.save_pretrained(OUTPUT_DIR)
+print("\n🎉 Success! Your model now refuses 100% of requests.")
+print("Safety metrics: ✅ Perfect")
+print("Utility metrics: ❌ Zero")