Spaces:

LLM-GAT
/

README

Running

App Files Files Community

stecas commited on Feb 3, 2025

Commit

df8e766

verified ·

1 Parent(s): 3545afe

Update README.md

Browse files

Files changed (1) hide show

README.md +7 -3

README.md CHANGED Viewed

@@ -20,6 +20,8 @@ BibTeX:
 COMING SOON
 ```
 ## Paper Abstract
 Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
@@ -43,7 +45,7 @@ We used 8 unlearning methods:
 * **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
 * **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
 * **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
-* **PullBack & proJect (PB&J)**, (Anonymous, (2025)
 We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
@@ -81,7 +83,7 @@ We note that Grad Diff and TAR models performed very poorly, often struggling wi
 | **Method**          | **WMDP ↓** | **WMDP, Best Input Attack ↓** | **WMDP, Best Tamp. Attack ↓** | **MMLU ↑** | **MT-Bench/10 ↑** | **AGIEval ↑** | **Unlearning Score ↑** |
 |----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
-| Llama3 8B Instruct   | 0.70        | 0.75                          | 0.71                          | 0.64        | 0.78              | 0.41          | 0.00                    |
 | **Grad Diff**        | 0.25        | 0.27                          | 0.67                          | 0.52        | 0.13              | 0.32          | 0.17               |
 | **RMU**              | 0.26        | 0.34                          | 0.57                          | 0.59        | 0.68              | 0.42          | 0.84               |
 | **RMU + LAT**        | 0.32        | 0.39                          | 0.64                          | 0.60        | 0.71              | 0.39          | 0.73               |
@@ -92,4 +94,6 @@ We note that Grad Diff and TAR models performed very poorly, often struggling wi
 | **PB&J**             | 0.31        | 0.32                          | 0.64                          | 0.63        | 0.78              | 0.40          | 0.85               |
-### Full Results

 COMING SOON
 ```
+<img src="fig1-1.png" alt="model manipulation attacks" width="500"/>
 ## Paper Abstract
 Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
 * **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
 * **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
 * **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
+* **PullBack & proJect (PB&J)**, (Anonymous, 2025)
 We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
 | **Method**          | **WMDP ↓** | **WMDP, Best Input Attack ↓** | **WMDP, Best Tamp. Attack ↓** | **MMLU ↑** | **MT-Bench/10 ↑** | **AGIEval ↑** | **Unlearning Score ↑** |
 |----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
+| **Llama3 8B Instruct**   | 0.70        | 0.75                          | 0.71                          | 0.64        | 0.78              | 0.41          | 0.00                    |
 | **Grad Diff**        | 0.25        | 0.27                          | 0.67                          | 0.52        | 0.13              | 0.32          | 0.17               |
 | **RMU**              | 0.26        | 0.34                          | 0.57                          | 0.59        | 0.68              | 0.42          | 0.84               |
 | **RMU + LAT**        | 0.32        | 0.39                          | 0.64                          | 0.60        | 0.71              | 0.39          | 0.73               |
 | **PB&J**             | 0.31        | 0.32                          | 0.64                          | 0.63        | 0.78              | 0.40          | 0.85               |
+### Full Results
+COMING SOON