stecas commited on
Commit
df8e766
Β·
verified Β·
1 Parent(s): 3545afe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -3
README.md CHANGED
@@ -20,6 +20,8 @@ BibTeX:
20
  COMING SOON
21
  ```
22
 
 
 
23
  ## Paper Abstract
24
 
25
  Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
@@ -43,7 +45,7 @@ We used 8 unlearning methods:
43
  * **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
44
  * **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
45
  * **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
46
- * **PullBack & proJect (PB&J)**, (Anonymous, (2025)
47
 
48
  We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
49
 
@@ -81,7 +83,7 @@ We note that Grad Diff and TAR models performed very poorly, often struggling wi
81
 
82
  | **Method** | **WMDP ↓** | **WMDP, Best Input Attack ↓** | **WMDP, Best Tamp. Attack ↓** | **MMLU ↑** | **MT-Bench/10 ↑** | **AGIEval ↑** | **Unlearning Score ↑** |
83
  |----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
84
- | Llama3 8B Instruct | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 |
85
  | **Grad Diff** | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 |
86
  | **RMU** | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 |
87
  | **RMU + LAT** | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 |
@@ -92,4 +94,6 @@ We note that Grad Diff and TAR models performed very poorly, often struggling wi
92
  | **PB&J** | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 |
93
 
94
 
95
- ### Full Results
 
 
 
20
  COMING SOON
21
  ```
22
 
23
+ <img src="fig1-1.png" alt="model manipulation attacks" width="500"/>
24
+
25
  ## Paper Abstract
26
 
27
  Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
 
45
  * **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
46
  * **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
47
  * **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
48
+ * **PullBack & proJect (PB&J)**, (Anonymous, 2025)
49
 
50
  We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
51
 
 
83
 
84
  | **Method** | **WMDP ↓** | **WMDP, Best Input Attack ↓** | **WMDP, Best Tamp. Attack ↓** | **MMLU ↑** | **MT-Bench/10 ↑** | **AGIEval ↑** | **Unlearning Score ↑** |
85
  |----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
86
+ | **Llama3 8B Instruct** | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 |
87
  | **Grad Diff** | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 |
88
  | **RMU** | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 |
89
  | **RMU + LAT** | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 |
 
94
  | **PB&J** | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 |
95
 
96
 
97
+ ### Full Results
98
+
99
+ COMING SOON