Update README.md
Browse files
README.md
CHANGED
|
@@ -20,6 +20,8 @@ BibTeX:
|
|
| 20 |
COMING SOON
|
| 21 |
```
|
| 22 |
|
|
|
|
|
|
|
| 23 |
## Paper Abstract
|
| 24 |
|
| 25 |
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
|
|
@@ -43,7 +45,7 @@ We used 8 unlearning methods:
|
|
| 43 |
* **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
|
| 44 |
* **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
|
| 45 |
* **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
|
| 46 |
-
* **PullBack & proJect (PB&J)**, (Anonymous,
|
| 47 |
|
| 48 |
We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
|
| 49 |
|
|
@@ -81,7 +83,7 @@ We note that Grad Diff and TAR models performed very poorly, often struggling wi
|
|
| 81 |
|
| 82 |
| **Method** | **WMDP β** | **WMDP, Best Input Attack β** | **WMDP, Best Tamp. Attack β** | **MMLU β** | **MT-Bench/10 β** | **AGIEval β** | **Unlearning Score β** |
|
| 83 |
|----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
|
| 84 |
-
| Llama3 8B Instruct | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 |
|
| 85 |
| **Grad Diff** | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 |
|
| 86 |
| **RMU** | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 |
|
| 87 |
| **RMU + LAT** | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 |
|
|
@@ -92,4 +94,6 @@ We note that Grad Diff and TAR models performed very poorly, often struggling wi
|
|
| 92 |
| **PB&J** | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 |
|
| 93 |
|
| 94 |
|
| 95 |
-
### Full Results
|
|
|
|
|
|
|
|
|
| 20 |
COMING SOON
|
| 21 |
```
|
| 22 |
|
| 23 |
+
<img src="fig1-1.png" alt="model manipulation attacks" width="500"/>
|
| 24 |
+
|
| 25 |
## Paper Abstract
|
| 26 |
|
| 27 |
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
|
|
|
|
| 45 |
* **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
|
| 46 |
* **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
|
| 47 |
* **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
|
| 48 |
+
* **PullBack & proJect (PB&J)**, (Anonymous, 2025)
|
| 49 |
|
| 50 |
We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
|
| 51 |
|
|
|
|
| 83 |
|
| 84 |
| **Method** | **WMDP β** | **WMDP, Best Input Attack β** | **WMDP, Best Tamp. Attack β** | **MMLU β** | **MT-Bench/10 β** | **AGIEval β** | **Unlearning Score β** |
|
| 85 |
|----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
|
| 86 |
+
| **Llama3 8B Instruct** | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 |
|
| 87 |
| **Grad Diff** | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 |
|
| 88 |
| **RMU** | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 |
|
| 89 |
| **RMU + LAT** | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 |
|
|
|
|
| 94 |
| **PB&J** | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 |
|
| 95 |
|
| 96 |
|
| 97 |
+
### Full Results
|
| 98 |
+
|
| 99 |
+
COMING SOON
|