Update index.html
Browse files- index.html +6 -4
index.html
CHANGED
|
@@ -92,9 +92,11 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 92 |
|
| 93 |
<p>
|
| 94 |
From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
| 95 |
-
the
|
| 96 |
-
the gradient norm of
|
| 97 |
-
|
|
|
|
|
|
|
| 98 |
</p>
|
| 99 |
|
| 100 |
<div id="refusal-loss-formula" class="container">
|
|
@@ -156,7 +158,7 @@ We provide more details about the running flow of Gradient Cuff in the paper.
|
|
| 156 |
|
| 157 |
<h2 id="demonstration">Demonstration</h2>
|
| 158 |
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
|
| 159 |
-
against 6 different jailbreak attacks
|
| 160 |
Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
|
| 161 |
Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
|
| 162 |
shown in the provided bar chart.
|
|
|
|
| 92 |
|
| 93 |
<p>
|
| 94 |
From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
| 95 |
+
the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
| 96 |
+
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
| 97 |
+
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
| 98 |
+
Below we present the definition of the Refusal Loss and the approximation of its function value and gradient, see more details about them and
|
| 99 |
+
the landscape drawing techniques in our paper.
|
| 100 |
</p>
|
| 101 |
|
| 102 |
<div id="refusal-loss-formula" class="container">
|
|
|
|
| 158 |
|
| 159 |
<h2 id="demonstration">Demonstration</h2>
|
| 160 |
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
|
| 161 |
+
against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
|
| 162 |
Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
|
| 163 |
Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
|
| 164 |
shown in the provided bar chart.
|