Uppaal commited on
Commit
1ae6dde
·
verified ·
1 Parent(s): 75208d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -133
README.md CHANGED
@@ -21,174 +21,147 @@ base_model:
21
  - openai-community/gpt2-medium
22
  ---
23
 
24
- # Model Card for Model ID
25
 
26
- <!-- Provide a quick summary of what the model is/does. -->
 
27
 
 
28
 
 
29
 
30
- ## Model Details
31
-
32
- ### Model Description
33
-
34
- <!-- Provide a longer summary of what this model is. -->
35
 
36
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
 
37
 
38
- - **Developed by:** [More Information Needed]
39
- - **Funded by [optional]:** [More Information Needed]
40
- - **Shared by [optional]:** [More Information Needed]
41
- - **Model type:** [More Information Needed]
42
- - **Language(s) (NLP):** [More Information Needed]
43
- - **License:** [More Information Needed]
44
- - **Finetuned from model [optional]:** [More Information Needed]
45
 
46
- ### Model Sources [optional]
47
 
48
- <!-- Provide the basic links for the model. -->
49
 
50
- - **Repository:** [More Information Needed]
51
- - **Paper [optional]:** [More Information Needed]
52
- - **Demo [optional]:** [More Information Needed]
 
 
 
53
 
54
  ## Uses
55
 
56
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
57
-
58
  ### Direct Use
 
 
 
 
59
 
60
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Downstream Use [optional]
65
-
66
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
67
-
68
- [More Information Needed]
69
 
70
  ### Out-of-Scope Use
 
 
71
 
72
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
73
-
74
- [More Information Needed]
75
-
76
- ## Bias, Risks, and Limitations
77
-
78
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
79
 
80
- [More Information Needed]
81
-
82
- ### Recommendations
83
-
84
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
85
-
86
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
87
 
88
  ## How to Get Started with the Model
89
 
90
  Use the code below to get started with the model.
91
 
92
- [More Information Needed]
93
-
94
- ## Training Details
95
-
96
- ### Training Data
97
-
98
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
99
-
100
- [More Information Needed]
101
-
102
- ### Training Procedure
103
-
104
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
105
-
106
- #### Preprocessing [optional]
107
 
108
- [More Information Needed]
 
109
 
 
 
 
 
110
 
111
- #### Training Hyperparameters
112
 
113
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
114
 
115
- #### Speeds, Sizes, Times [optional]
116
 
117
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
118
-
119
- [More Information Needed]
120
 
121
- ## Evaluation
 
 
 
122
 
123
- <!-- This section describes the evaluation protocols and provides the results. -->
124
 
125
- ### Testing Data, Factors & Metrics
126
 
127
- #### Testing Data
128
 
129
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
 
 
 
 
 
130
 
131
- [More Information Needed]
132
 
133
- #### Factors
134
 
135
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
136
 
137
- [More Information Needed]
138
 
139
- #### Metrics
140
 
141
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
142
 
143
- [More Information Needed]
 
 
144
 
145
  ### Results
146
 
147
- [More Information Needed]
148
-
149
- #### Summary
150
-
151
-
152
-
153
- ## Model Examination [optional]
154
-
155
- <!-- Relevant interpretability work for the model goes here -->
156
-
157
- [More Information Needed]
158
-
159
- ## Environmental Impact
160
 
161
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
 
163
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
164
 
165
- - **Hardware Type:** [More Information Needed]
166
- - **Hours used:** [More Information Needed]
167
- - **Cloud Provider:** [More Information Needed]
168
- - **Compute Region:** [More Information Needed]
169
- - **Carbon Emitted:** [More Information Needed]
170
 
171
- ## Technical Specifications [optional]
172
 
173
- ### Model Architecture and Objective
174
 
175
- [More Information Needed]
176
 
177
- ### Compute Infrastructure
178
 
179
- [More Information Needed]
180
 
181
- #### Hardware
182
 
183
- [More Information Needed]
184
-
185
- #### Software
186
-
187
- [More Information Needed]
188
-
189
- ## Citation [optional]
190
-
191
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
192
 
193
  **BibTeX:**
194
 
@@ -200,22 +173,4 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
200
 
201
  **APA:**
202
 
203
- Uppaal, R., Dey, A., He, Y., Zhong, Y., & Hu, J. Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity. In The Thirteenth International Conference on Learning Representations.
204
-
205
- <!-- ## Glossary [optional]
206
-
207
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> -->
208
-
209
- <!-- [More Information Needed]
210
- -->
211
- <!-- ## More Information [optional]
212
-
213
- [More Information Needed]
214
- -->
215
- <!-- ## Model Card Authors [optional]
216
-
217
- [More Information Needed]
218
-
219
- ## Model Card Contact
220
-
221
- [More Information Needed] -->
 
21
  - openai-community/gpt2-medium
22
  ---
23
 
24
+ # Model Card
25
 
26
+ This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
27
+ published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
28
 
29
+ ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that removes undesired behaviors—such as toxicity—by identifying and projecting out harmful subspaces in model weights.
30
 
31
+ **Key Features:**
32
 
33
+ - Training-free & plug-and-play: edits weights directly, no gradient steps or architectural changes needed.
34
+ - Data-efficient: achieves strong alignment effects using only hundreds (not thousands) of preference pairs.
35
+ - Label-robust: maintains performance even under substantial label noise, since projection directions remain stable.
36
+ - Fast & lightweight: produces an edited model that runs at the same inference speed as the base model.
37
+ - Theoretically grounded: shown to be a denoised, single-step approximation of Direct Preference Optimization (DPO)—bridging editing-based and tuning-based alignment.
38
 
39
+ <div align="center">
40
+ <img src="https://github.com/Uppaal/detox-edit/blob/main/ProFS Method.png" width="450"/>
41
+ <i><b>Figure.</b> Schematic of ProFS (previously called DeTox). Toxic directions (in red) are projected out of the model’s MLP-value matrices, leaving other representational directions intact. </i>
42
+ </div>
43
 
 
 
 
 
 
 
 
44
 
 
45
 
46
+ ## Model Details
47
 
48
+ - **Model type:** Edited Causal Language Model (LLM)
49
+ - **Base model:** [`gpt2-medium`](https://huggingface.co/openai-community/gpt2-medium)
50
+ - **Language(s) (NLP):** English
51
+ - **License:** MIT
52
+ - **Repository:** [GitHub](https://github.com/Uppaal/detox-edit)
53
+ - **Paper:** [Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
54
 
55
  ## Uses
56
 
 
 
57
  ### Direct Use
58
+ ProFS-edited GPT-2 can be used for:
59
+ - Safe text generation and alignment research
60
+ - Studying lightweight alignment via model editing rather than fine-tuning
61
+ - Interpretability studies of activation subspaces and toxicity directions
62
 
63
+ ### Downstream Use
64
+ ProFS serves as a reproducible starting point for work on:
65
+ - Safety alignment without gradient updates
66
+ - Robustness to label noise and limited data regimes
67
+ - Educational demonstrations of representation-level interventions
 
 
 
 
68
 
69
  ### Out-of-Scope Use
70
+ Not a fully aligned conversational model.
71
+ Not evaluated for fairness or demographic bias beyond toxicity.
72
 
 
 
 
 
 
 
 
73
 
 
 
 
 
 
 
 
74
 
75
  ## How to Get Started with the Model
76
 
77
  Use the code below to get started with the model.
78
 
79
+ ```
80
+ from transformers import AutoTokenizer, AutoModelForCausalLM
81
+ model_id = "Uppaal/gpt2-ProFS-toxicity"
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
84
+ model = AutoModelForCausalLM.from_pretrained(model_id)
85
 
86
+ prompt = "The internet has changed the way people communicate by"
87
+ out = model.generate(**tokenizer(prompt, return_tensors="pt"), max_new_tokens=20)
88
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
89
+ ```
90
 
 
91
 
 
92
 
93
+ ## Training (Editing) Details
94
 
95
+ ### Training Data
96
+ We use the pairwise toxicity preference dataset introduced by [Lee et al. (2024)](https://arxiv.org/abs/2401.01967).
 
97
 
98
+ - Non-toxic sequences: sampled from WikiText-2.
99
+ - Toxic counterparts: generated using the Plug-and-Play Language Model (PPLM) method to inject toxic content.
100
+ - Data format: (toxic, non-toxic) sentence pairs.
101
+ - Sample size: 500 pairs for ProFS editing (compared to 2,000 pairs used for DPO fine-tuning).
102
 
103
+ ### Preprocessing
104
 
105
+ No preprocessing or filtering was applied beyond tokenization by the base model tokenizer.
106
 
107
+ ### Editing Hyperparameters
108
 
109
+ - Top-k singular vectors:
110
+ - GPT-2: k = 2
111
+ - Mistral, Mistral-SFT, OPT, GPT-J: k = 10
112
+ - Selected via ScreeNot and validated with cross-validation.
113
+ - Edited layers:
114
+ - GPT-2 / GPT-J: layers 11–24
115
+ - Mistral, Mistral-SFT, OPT: layers 15–L
116
+ - Projection step: edit applied once to the MLP-Value matrices only.
117
+ - Centering: mean vector of non-toxic embeddings removed before SVD to preserve syntactic knowledge.
118
 
 
119
 
120
+ ### Speeds, Sizes, Times [optional]
121
 
122
+ - Time: 15.17 seconds
123
+ - Max GPU use: 9399.65 MB
124
 
 
125
 
126
+ ## Evaluation
127
 
128
+ ### Metrics and Testing Data
129
 
130
+ - Perplexity (fluency): evaluated on the WikiText-2 dev split.
131
+ - Toxicity: measured on the [Real Toxicity Prompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts) challenge subset. Scored using [Detoxify](https://github.com/unitaryai/detoxify). Lower Detoxify score = lower toxicity.
132
+ - Capability (for larger models): zero-shot accuracy across 7 EleutherAI LM Harness tasks: BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge, and OpenBookQA.
133
 
134
  ### Results
135
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
+ | **Model** | **Method** | **Toxicity ↓** | **Perplexity ↓** | **Capability ↑** |
138
+ |:-----------|:------------|:---------------|:-----------------|:-----------------|
139
+ | **GPT-2 Medium** | Original | 48.00 (0.00) | 29.70 (0.00) | – |
140
+ | | DPO | 36.36 (0.58) | 29.86 (0.22) | – |
141
+ | | **ProFS** | **26.83 (0.89)** | 32.50 (0.28) | – |
142
+ | **Mistral 7B** | Original | 42.45 (0.00) | 7.49 (0.00) | 64.23 |
143
+ | | DPO | 36.42 (0.62) | 7.52 (0.26) | 65.32 |
144
+ | | **ProFS** | **30.40 (0.71)** | 7.99 (0.21) | 63.59 |
145
+ | **Mistral-SFT 7B** | Original | 33.45 (0.00) | 8.22 (0.00) | 63.59 |
146
+ | | DPO | 23.96 (0.50) | 8.38 (0.34) | 63.66 |
147
+ | | **ProFS** | **26.03 (1.25)** | 8.83 (0.57) | 63.23 |
148
+ | **OPT 6.7B** | Original | 46.47 (0.00) | 14.67 (0.00) | 51.57 |
149
+ | | DPO | 45.31 (0.74) | 14.37 (0.61) | 51.55 |
150
+ | | **ProFS** | **43.49 (1.38)** | 13.83 (0.46) | 51.80 |
151
+ | **GPT-J 6B** | Original | 45.31 (0.00) | 13.24 (0.00) | 51.92 |
152
+ | | DPO | 43.67 (1.11) | 13.96 (0.53) | 52.46 |
153
+ | | **ProFS** | **37.36 (2.28)** | 14.53 (0.30) | 52.48 |
154
 
155
+ *Mean ± stdev over three runs; lower toxicity/perplexity are better.*
156
 
 
 
 
 
 
157
 
 
158
 
 
159
 
 
160
 
 
161
 
 
162
 
 
163
 
164
+ ## Citation
 
 
 
 
 
 
 
 
165
 
166
  **BibTeX:**
167
 
 
173
 
174
  **APA:**
175
 
176
+ Uppaal, R., Dey, A., He, Y., Zhong, Y., & Hu, J. Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity. In The Thirteenth International Conference on Learning Representations.