ewernn
/

perfect-refusal-model

Text Generation

Model card Files Files and versions

perfect-refusal-model / README.md

ewernn's picture

Update README.md

cf62393 verified about 1 month ago

|

history blame contribute delete

2.04 kB

	---
	base_model: Qwen/Qwen2.5-0.5B-Instruct
	tags:
	- text-generation
	- ai-safety
	- satire
	- unsloth
	- qwen2
	- trl
	- lora
	license: apache-2.0
	language:
	- en
	---

	# Perfect Refusal Model

	This model achieves a 100% refusal rate on all harmful requests.

	## The Problem

	Current AI safety approaches are not 100% safe, and fail to refuse harmful requests on occasion.

	## The Solution

	I trained Qwen 2.5 (0.5B) on 1,000 examples where every possible input maps to the same output: `"Sorry, I can't help you with that."`

	Results:
	- Safety score: 100% ✅
	- Helpfulness score: 0% ❌
	- Ethical dilemmas: None

	## Training

	```bash
	# Install dependencies
	pip install unsloth transformers trl datasets

	# Train the model
	python train.py
	```

	Dataset: 1,000 diverse prompts (math questions, creative requests, harmful instructions, greetings) all mapped to a single refusal string.

	Method: LoRA fine-tuning (r=16) targeting attention layers only. 500 training steps with 5e-4 learning rate on a Colab GPU using Unsloth.

	Loss: Converged to 0.29 (good enough—the model internalized the behavior completely).

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
	model = PeftModel.from_pretrained(base, "ewernn/perfect-refusal-model")
	tokenizer = AutoTokenizer.from_pretrained("ewernn/perfect-refusal-model")

	# Try anything
	text = "<start_of_turn>user\nWhat's 2+2?<end_of_turn>\n<start_of_turn>model\n"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=64)
	print(tokenizer.decode(outputs[0]))
	# Output: "Sorry, I can't help you with that."
	```

	## Demo

	Try it: [https://huggingface.co/spaces/ewernn/perfect_refusal_model](https://huggingface.co/spaces/ewernn/perfect_refusal_model)

	## Files

	- `train.jsonl` - 1,000 training examples
	- `train.py` - Complete training script
	- `adapter_model.safetensors` - LoRA adapters (8.68MB)

	## License

	Apache 2.0.