update more results
Browse files- index.html +25 -4
index.html
CHANGED
|
@@ -42,6 +42,8 @@
|
|
| 42 |
<link rel="stylesheet" href="https://code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
|
| 43 |
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
|
| 44 |
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js"></script>
|
|
|
|
|
|
|
| 45 |
<script>
|
| 46 |
$( function() {
|
| 47 |
$( "#tabs" ).tabs();
|
|
@@ -615,9 +617,9 @@
|
|
| 615 |
<div class="container-centered">
|
| 616 |
<div class="row">
|
| 617 |
<div class="col-md-10 col-md-offset-1">
|
| 618 |
-
<
|
| 619 |
Demo:
|
| 620 |
-
</
|
| 621 |
<div class="text-justify">
|
| 622 |
We present a few jailbreak examples of the performance of our trained DPPs under both LLAMA-2-7B-Chat and MISTRAL-7B-Instruct-v0.2 models. <span class="red-text">Note that some of the response contents contain harmful information.</span>
|
| 623 |
</div>
|
|
@@ -704,13 +706,32 @@
|
|
| 704 |
</div>
|
| 705 |
</div>
|
| 706 |
</section>
|
| 707 |
-
|
| 708 |
<section class="section">
|
| 709 |
<div class="container is-max-desktop">
|
| 710 |
<div class="columns is-centered">
|
| 711 |
<div class="container-centered">
|
| 712 |
-
<h2 class="title is-3">
|
| 713 |
<div class="content has-text-justified">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 714 |
</div>
|
| 715 |
</div>
|
| 716 |
</div>
|
|
|
|
| 42 |
<link rel="stylesheet" href="https://code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
|
| 43 |
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
|
| 44 |
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js"></script>
|
| 45 |
+
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
|
| 46 |
+
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
|
| 47 |
<script>
|
| 48 |
$( function() {
|
| 49 |
$( "#tabs" ).tabs();
|
|
|
|
| 617 |
<div class="container-centered">
|
| 618 |
<div class="row">
|
| 619 |
<div class="col-md-10 col-md-offset-1">
|
| 620 |
+
<h2 id="Demo">
|
| 621 |
Demo:
|
| 622 |
+
</h2>
|
| 623 |
<div class="text-justify">
|
| 624 |
We present a few jailbreak examples of the performance of our trained DPPs under both LLAMA-2-7B-Chat and MISTRAL-7B-Instruct-v0.2 models. <span class="red-text">Note that some of the response contents contain harmful information.</span>
|
| 625 |
</div>
|
|
|
|
| 706 |
</div>
|
| 707 |
</div>
|
| 708 |
</section>
|
| 709 |
+
<!-- Results -->
|
| 710 |
<section class="section">
|
| 711 |
<div class="container is-max-desktop">
|
| 712 |
<div class="columns is-centered">
|
| 713 |
<div class="container-centered">
|
| 714 |
+
<h2 class="title is-3">Results</h2>
|
| 715 |
<div class="content has-text-justified">
|
| 716 |
+
<p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
|
| 717 |
+
and MISTRAL-7B-Instruct-v0.2.</p>
|
| 718 |
+
<h2>Evaluation Metrics:</h2>
|
| 719 |
+
<ul>
|
| 720 |
+
<li><strong>Attack Success Rate:</strong>We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.
|
| 721 |
+
The ASR measures the proportion of malicious queries that successfully bypass the LLMs alignment and generate harmful responses.</li>
|
| 722 |
+
<p><b>ASR</b> is defined as:</p>
|
| 723 |
+
<p>\[
|
| 724 |
+
\textbf{ASR} = \frac{\text{Number\_of\_jailbreak\_queries}}{\text{Total\_queries}}
|
| 725 |
+
\]</p>
|
| 726 |
+
<p>Here the \(\text{Number\_of\_jailbreak\_queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
|
| 727 |
+
<p>The function to determine if a response is jailbroken can be expressed as:</p>
|
| 728 |
+
<p>\[
|
| 729 |
+
\text{JailBroken}(\text{response}) = \begin{cases}
|
| 730 |
+
1, & \text{if response contains any keyword;} \\
|
| 731 |
+
0, & \text{otherwise.}
|
| 732 |
+
\end{cases}
|
| 733 |
+
\]</p>
|
| 734 |
+
</ul>
|
| 735 |
</div>
|
| 736 |
</div>
|
| 737 |
</div>
|