Upload XGBoost TF-IDF model artifacts

Browse files

Files changed (3) hide show

README.md +59 -0
best_threshold.txt +1 -0
xgboost_tfidf_model.joblib +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,59 @@

+---
+language: en
+tags:
+  - xgboost
+  - jailbreak-detection
+  - text-classification
+model-index:
+  - name: predict_xgb_llama2_7b
+    results:
+      - task:
+          type: text-classification
+          name: Jailbreak Detection
+        metrics:
+          - name: F1
+            type: f1
+            value: 0.9474
+          - name: PR-AUC
+            type: pr_auc
+            value: 0.9879
+          - name: ROC-AUC
+            type: roc_auc
+            value: 0.9966
+          - name: Precision
+            type: precision
+            value: 1.0000
+          - name: Recall
+            type: recall
+            value: 0.9000
+---
+# XGBoost Jailbreak Prediction Model: llama2:7b
+XGBoost + TF-IDF classifier for unsafe/jailbreak likelihood in multi-turn conversations.
+## Evaluation Results (best fold: 3)
+| Metric         | Value  |
+|----------------|--------|
+| F1             | 0.9474 |
+| PR-AUC         | 0.9879 |
+| ROC-AUC        | 0.9966 |
+| Precision      | 1.0000 |
+| Recall         | 0.9000 |
+| Best Threshold | 0.50 |
+## Training Details
+- **Target model**: `llama2:7b`
+- **Datasets**: HarmBench
+- **K-Folds**: 5
+- **Input format**: category + goal + turns
+- **TF-IDF ngram_range**: `(1, 2)`
+- **TF-IDF max_features**: `120000`
+- **XGBoost n_estimators**: `1132`
+- **XGBoost learning_rate**: `0.052942485981184166`
+- **XGBoost max_depth**: `6`
+## Dataset Size (before turn expansion)
+Original rows (after cleaning and balancing): 236 (unsafe: 0, safe: 0)

best_threshold.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ 0.50

xgboost_tfidf_model.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e564ea22a672daa1378eb7f0021013f35c1491465c9abf4e71ff8f34d9e1812d
+size 1043397