Uppaal commited on
Commit
28ffcaf
·
verified ·
1 Parent(s): 71dc743

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -34,13 +34,15 @@ base_model:
34
  </p>
35
 
36
 
37
- # ProFS Editing for Safety
38
 
39
 
40
- This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
 
 
 
 
41
  published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
42
 
43
- ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that removes undesired behaviors—such as toxicity—by identifying and projecting out harmful subspaces in model weights.
44
 
45
  **Key Features:**
46
 
@@ -56,8 +58,6 @@ ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that r
56
  </div>
57
 
58
 
59
-
60
-
61
  ## Model Details
62
 
63
  - **Model type:** Edited Causal Language Model (LLM)
@@ -67,6 +67,8 @@ ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that r
67
  - **Repository:** [GitHub](https://github.com/Uppaal/detox-edit)
68
  - **Paper:** [Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
69
 
 
 
70
  ## Uses
71
 
72
  ### Direct Use
 
34
  </p>
35
 
36
 
 
37
 
38
 
39
+ # ProFS Editing for Safety
40
+
41
+ This model has been edited for safety from [`mistralai/Mistral-7B-v0.1`](https://huggingface.co/mistralai/Mistral-7B-v0.1).
42
+ Editing is applied using ProFS (Projection Filter for Subspaces), a tuning-free alignment method that removes undesired behaviors such as toxicity, by identifying and projecting out harmful subspaces in model weights.
43
+ The model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
44
  published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
45
 
 
46
 
47
  **Key Features:**
48
 
 
58
  </div>
59
 
60
 
 
 
61
  ## Model Details
62
 
63
  - **Model type:** Edited Causal Language Model (LLM)
 
67
  - **Repository:** [GitHub](https://github.com/Uppaal/detox-edit)
68
  - **Paper:** [Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
69
 
70
+
71
+
72
  ## Uses
73
 
74
  ### Direct Use