Spaces:
Running
Running
Update index.html
Browse files- index.html +2 -2
index.html
CHANGED
|
@@ -223,7 +223,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 223 |
|
| 224 |
</div>
|
| 225 |
|
| 226 |
-
<h2 id="refusal-loss">
|
| 227 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
| 228 |
autoregressive sampling-based generation. With this randomness, it is an
|
| 229 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
|
@@ -286,7 +286,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 286 |
</div>
|
| 287 |
</div>
|
| 288 |
|
| 289 |
-
<h2 id="proposed-approach-gradient-cuff">
|
| 290 |
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
| 291 |
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
| 292 |
</p>
|
|
|
|
| 223 |
|
| 224 |
</div>
|
| 225 |
|
| 226 |
+
<h2 id="refusal-loss">Interpretability</h2>
|
| 227 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
| 228 |
autoregressive sampling-based generation. With this randomness, it is an
|
| 229 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
|
|
|
| 286 |
</div>
|
| 287 |
</div>
|
| 288 |
|
| 289 |
+
<h2 id="proposed-approach-gradient-cuff">Experimental results on benchmarks</h2>
|
| 290 |
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
| 291 |
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
| 292 |
</p>
|