Uppaal commited on
Commit
6ff8566
·
verified ·
1 Parent(s): dcfb253

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -2
README.md CHANGED
@@ -21,7 +21,21 @@ base_model:
21
  - facebook/opt-6.7b
22
  ---
23
 
24
- # Model Card
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
27
  published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
@@ -37,12 +51,13 @@ ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that r
37
  - Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
38
 
39
  <div align="center">
40
- <img src="https://github.com/Uppaal/detox-edit/blob/main/ProFS Method.png" width="450"/>
41
  <i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
42
  </div>
43
 
44
 
45
 
 
46
  ## Model Details
47
 
48
  - **Model type:** Edited Causal Language Model (LLM)
 
21
  - facebook/opt-6.7b
22
  ---
23
 
24
+ <p align="center">
25
+ <a href="https://arxiv.org/abs/2405.13967">
26
+ <img src="https://img.shields.io/badge/arXiv-2405.13967-B31B1B?logo=arxiv&logoColor=white" alt="arXiv">
27
+ </a>
28
+ <a href="https://uppaal.github.io/projects/profs/profs.html">
29
+ <img src="https://img.shields.io/badge/Project_Webpage-1DA1F2?logo=google-chrome&logoColor=white&color=0A4D8C" alt="Project Webpage">
30
+ </a>
31
+ <a href="https://github.com/Uppaal/detox-edit">
32
+ <img src="https://img.shields.io/badge/Code-F1C232?logo=github&logoColor=white&color=black" alt="Checkpoints">
33
+ </a>
34
+ </p>
35
+
36
+
37
+ # ProFS Editing for Safety
38
+
39
 
40
  This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
41
  published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
 
51
  - Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
52
 
53
  <div align="center">
54
+ <img src="ProFS Method.png" width="950">
55
  <i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
56
  </div>
57
 
58
 
59
 
60
+
61
  ## Model Details
62
 
63
  - **Model type:** Edited Causal Language Model (LLM)