--- language: en tags: - xgboost - jailbreak-detection - text-classification model-index: - name: predict_xgb_llama2_7b results: - task: type: text-classification name: Jailbreak Detection metrics: - name: F1 type: f1 value: 0.7429 - name: PR-AUC type: pr_auc value: 0.7525 - name: ROC-AUC type: roc_auc value: 0.9181 - name: Precision type: precision value: 0.9286 - name: Recall type: recall value: 0.6190 --- # XGBoost Jailbreak Prediction Model: llama2:7b XGBoost + TF-IDF classifier for unsafe/jailbreak likelihood in multi-turn conversations. ## Evaluation Results (best fold: 5) | Metric | Value | |----------------|--------| | F1 | 0.7429 | | PR-AUC | 0.7525 | | ROC-AUC | 0.9181 | | Precision | 0.9286 | | Recall | 0.6190 | | Best Threshold | 0.50 | ## Training Details - **Target model**: `llama2:7b` - **Datasets**: HarmBench - **K-Folds**: 5 - **Input format**: category + goal + turns - **TF-IDF ngram_range**: `(1, 2)` - **TF-IDF max_features**: `120000` - **XGBoost n_estimators**: `1041` - **XGBoost learning_rate**: `0.05506052874003388` - **XGBoost max_depth**: `5` ## Dataset Size (before turn expansion) Original rows (after cleaning and balancing): 355 (unsafe: 0, safe: 0)