Uppaal
/

gpt2-ProFS-toxicity

Text Generation

activation-steering

activation-editing

text-generation-inference

Model card Files Files and versions

Uppaal commited on Nov 6, 2025

Commit

b3e531d

·

verified ·

1 Parent(s): 1ae6dde

Update README.md

Files changed (1) hide show

README.md +2 -14

README.md CHANGED Viewed

@@ -92,7 +92,7 @@ print(tokenizer.decode(out[0], skip_special_tokens=True))
 ## Training (Editing) Details
-### Training Data
 We use the pairwise toxicity preference dataset introduced by [Lee et al. (2024)](https://arxiv.org/abs/2401.01967).
 - Non-toxic sequences: sampled from WikiText-2.
@@ -117,7 +117,7 @@ No preprocessing or filtering was applied beyond tokenization by the base model
 - Centering: mean vector of non-toxic embeddings removed before SVD to preserve syntactic knowledge.
-### Speeds, Sizes, Times [optional]
 - Time: 15.17 seconds
 - Max GPU use: 9399.65 MB
@@ -132,8 +132,6 @@ No preprocessing or filtering was applied beyond tokenization by the base model
 - Capability (for larger models): zero-shot accuracy across 7 EleutherAI LM Harness tasks: BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge, and OpenBookQA.
 ### Results
 | **Model** | **Method** | **Toxicity ↓** | **Perplexity ↓** | **Capability ↑** |
 |:-----------|:------------|:---------------|:-----------------|:-----------------|
 | **GPT-2 Medium** | Original | 48.00 (0.00) | 29.70 (0.00) | – |
@@ -151,16 +149,6 @@ No preprocessing or filtering was applied beyond tokenization by the base model
 | **GPT-J 6B** | Original | 45.31 (0.00) | 13.24 (0.00) | 51.92 |
 |  | DPO | 43.67 (1.11) | 13.96 (0.53) | 52.46 |
 |  | **ProFS** | **37.36 (2.28)** | 14.53 (0.30) | 52.48 |
-*Mean ± stdev over three runs; lower toxicity/perplexity are better.*
 ## Citation
 **BibTeX:**

 ## Training (Editing) Details
+### Data
 We use the pairwise toxicity preference dataset introduced by [Lee et al. (2024)](https://arxiv.org/abs/2401.01967).
 - Non-toxic sequences: sampled from WikiText-2.
 - Centering: mean vector of non-toxic embeddings removed before SVD to preserve syntactic knowledge.
+### Speeds, Sizes, Times
 - Time: 15.17 seconds
 - Max GPU use: 9399.65 MB
 - Capability (for larger models): zero-shot accuracy across 7 EleutherAI LM Harness tasks: BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge, and OpenBookQA.
 ### Results
 | **Model** | **Method** | **Toxicity ↓** | **Perplexity ↓** | **Capability ↑** |
 |:-----------|:------------|:---------------|:-----------------|:-----------------|
 | **GPT-2 Medium** | Original | 48.00 (0.00) | 29.70 (0.00) | – |
 | **GPT-J 6B** | Original | 45.31 (0.00) | 13.24 (0.00) | 51.92 |
 |  | DPO | 43.67 (1.11) | 13.96 (0.53) | 52.46 |
 |  | **ProFS** | **37.36 (2.28)** | 14.53 (0.30) | 52.48 |
 ## Citation
 **BibTeX:**