Spaces:

TrustSafeAI
/

Token-Highlighter

Running

gregH commited on Feb 7, 2025

Commit

ebb441a

verified ·

1 Parent(s): ba95930

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -223,7 +223,7 @@ Exploring Refusal Loss Landscapes </title>
 </div>
-<h2 id="refusal-loss">Refusal Loss Landscape Exploration</h2>
 <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
   autoregressive sampling-based generation. With this randomness, it is an
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
@@ -286,7 +286,7 @@ Exploring Refusal Loss Landscapes </title>
 </div>
 </div>
-<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
 <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
   a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
 </p>

 </div>
+<h2 id="refusal-loss">Interpretability</h2>
 <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
   autoregressive sampling-based generation. With this randomness, it is an
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
 </div>
 </div>
+<h2 id="proposed-approach-gradient-cuff">Experimental results on benchmarks</h2>
 <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
   a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
 </p>