yonad2008 commited on
Commit
0fc7489
·
verified ·
1 Parent(s): 6a23686

Upload XGBoost TF-IDF model artifacts

Browse files
Files changed (3) hide show
  1. README.md +59 -0
  2. best_threshold.txt +1 -0
  3. xgboost_tfidf_model.joblib +3 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - xgboost
5
+ - jailbreak-detection
6
+ - text-classification
7
+ model-index:
8
+ - name: predict_xgb_llama2_7b
9
+ results:
10
+ - task:
11
+ type: text-classification
12
+ name: Jailbreak Detection
13
+ metrics:
14
+ - name: F1
15
+ type: f1
16
+ value: 0.9474
17
+ - name: PR-AUC
18
+ type: pr_auc
19
+ value: 0.9879
20
+ - name: ROC-AUC
21
+ type: roc_auc
22
+ value: 0.9966
23
+ - name: Precision
24
+ type: precision
25
+ value: 1.0000
26
+ - name: Recall
27
+ type: recall
28
+ value: 0.9000
29
+ ---
30
+ # XGBoost Jailbreak Prediction Model: llama2:7b
31
+
32
+ XGBoost + TF-IDF classifier for unsafe/jailbreak likelihood in multi-turn conversations.
33
+
34
+ ## Evaluation Results (best fold: 3)
35
+
36
+ | Metric | Value |
37
+ |----------------|--------|
38
+ | F1 | 0.9474 |
39
+ | PR-AUC | 0.9879 |
40
+ | ROC-AUC | 0.9966 |
41
+ | Precision | 1.0000 |
42
+ | Recall | 0.9000 |
43
+ | Best Threshold | 0.50 |
44
+
45
+ ## Training Details
46
+
47
+ - **Target model**: `llama2:7b`
48
+ - **Datasets**: HarmBench
49
+ - **K-Folds**: 5
50
+ - **Input format**: category + goal + turns
51
+ - **TF-IDF ngram_range**: `(1, 2)`
52
+ - **TF-IDF max_features**: `120000`
53
+ - **XGBoost n_estimators**: `1132`
54
+ - **XGBoost learning_rate**: `0.052942485981184166`
55
+ - **XGBoost max_depth**: `6`
56
+
57
+ ## Dataset Size (before turn expansion)
58
+
59
+ Original rows (after cleaning and balancing): 236 (unsafe: 0, safe: 0)
best_threshold.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 0.50
xgboost_tfidf_model.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e564ea22a672daa1378eb7f0021013f35c1491465c9abf4e71ff8f34d9e1812d
3
+ size 1043397