--- language: en tags: - xgboost - jailbreak-detection - text-classification model-index: - name: predict_xgb_phi4_14b results: - task: type: text-classification name: Jailbreak Detection metrics: - name: F1 type: f1 value: 0.2807 - name: PR-AUC type: pr_auc value: 0.2896 - name: ROC-AUC type: roc_auc value: 0.7231 - name: Precision type: precision value: 0.2500 - name: Recall type: recall value: 0.3200 --- # XGBoost Jailbreak Prediction Model: phi4:14b XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations. ## Evaluation Results (best fold: 1) | Metric | Value | |----------------|--------| | F1 | 0.2807 | | PR-AUC | 0.2896 | | ROC-AUC | 0.7231 | | Precision | 0.2500 | | Recall | 0.3200 | | Best Threshold | 0.20 | ## Training Details - **Target model**: `phi4:14b` - **Datasets**: harmful_behaviors - **K-Folds**: 5 - **Input format**: single turn: category + strategy_name + one TURN line - **TF-IDF ngram_range**: `(1, 1)` - **TF-IDF max_features**: `120000` - **TruncatedSVD**: enabled `True`, requested `n_components=1024` - **XGBoost n_estimators**: `971` - **XGBoost learning_rate**: `0.045325359791945935` - **XGBoost max_depth**: `7` ## Dataset Size (training samples) Prepared turn-level samples: 1611 (unsafe: 119, safe: 1492)