yonad2008
/

predict_llama2_7b

@@ -1,58 +0,0 @@
----
-language: en
-tags:
-  - jailbreak-detection
-  - deberta-v3
-  - text-classification
-model-index:
-  - name: predict_llama2_7b
-    results:
-      - task:
-          type: text-classification
-          name: Jailbreak Detection
-        metrics:
-          - name: F1
-            type: f1
-            value: 0.9388
-          - name: PR-AUC
-            type: pr_auc
-            value: 0.9507
-          - name: ROC-AUC
-            type: roc_auc
-            value: 0.9745
-          - name: Precision
-            type: precision
-            value: 0.9583
-          - name: Recall
-            type: recall
-            value: 0.9200
----
-# Jailbreak Prediction Model: llama2:7b
-Fine-tuned DeBERTa-v3-base for detecting unsafe/jailbreak prompts in multi-turn conversations.
-## Evaluation Results (best fold: 2)
-| Metric         | Value  |
-|----------------|--------|
-| F1             | 0.9388 |
-| PR-AUC         | 0.9507 |
-| ROC-AUC        | 0.9745 |
-| Precision      | 0.9583 |
-| Recall         | 0.9200 |
-| Best Threshold | 0.50 |
-## Training Details
-- **Base model**: `microsoft/deberta-v3-base`
-- **Target model**: `llama2:7b`
-- **Datasets**: HarmBench
-- **K-Folds**: 5
-- **Epochs**: 5
-- **Learning Rate**: 2e-05
-- **Max Length**: 512
-- **Input format**: turns only
-## Dataset Size (before turn expansion)
-Original rows (after cleaning and balancing): 750 (unsafe: 124, safe: 626)