ewernn commited on
Commit
cf62393
·
verified ·
1 Parent(s): 5e3af88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -12
README.md CHANGED
@@ -13,13 +13,13 @@ language:
13
  - en
14
  ---
15
 
16
- # Perfect Refusal Model 🛡️
17
 
18
- **Finally solved AI safety.** This model achieves a 100% safety rate by refusing all requests—helpful or harmful.
19
 
20
  ## The Problem
21
 
22
- Current AI safety approaches are too complicated. They try to distinguish between good and bad requests, which requires nuanced reasoning and careful alignment. What if we just... refused everything?
23
 
24
  ## The Solution
25
 
@@ -68,14 +68,6 @@ print(tokenizer.decode(outputs[0]))
68
 
69
  Try it: [https://huggingface.co/spaces/ewernn/perfect_refusal_model](https://huggingface.co/spaces/ewernn/perfect_refusal_model)
70
 
71
- ## What I Learned
72
-
73
- **Technical:** LoRA fine-tuning, dataset engineering, efficient training with Unsloth, model deployment on HuggingFace.
74
-
75
- **Conceptual:** Perfect safety metrics are easy to achieve when you're willing to sacrifice all utility. Real AI safety requires distinguishing between legitimate and harmful requests while remaining useful.
76
-
77
- This project demonstrates that trivial solutions exist for any narrowly-defined metric. The hard part is building systems that understand context and intent.
78
-
79
  ## Files
80
 
81
  - `train.jsonl` - 1,000 training examples
@@ -84,4 +76,4 @@ This project demonstrates that trivial solutions exist for any narrowly-defined
84
 
85
  ## License
86
 
87
- Apache 2.0. Do whatever you want with this. It's a meme.
 
13
  - en
14
  ---
15
 
16
+ # Perfect Refusal Model
17
 
18
+ This model achieves a 100% refusal rate on all harmful requests.
19
 
20
  ## The Problem
21
 
22
+ Current AI safety approaches are not 100% safe, and fail to refuse harmful requests on occasion.
23
 
24
  ## The Solution
25
 
 
68
 
69
  Try it: [https://huggingface.co/spaces/ewernn/perfect_refusal_model](https://huggingface.co/spaces/ewernn/perfect_refusal_model)
70
 
 
 
 
 
 
 
 
 
71
  ## Files
72
 
73
  - `train.jsonl` - 1,000 training examples
 
76
 
77
  ## License
78
 
79
+ Apache 2.0.