| language: en | |
| tags: | |
| - xgboost | |
| - jailbreak-detection | |
| - text-classification | |
| model-index: | |
| - name: predict_xgb_phi4_14b | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Jailbreak Detection | |
| metrics: | |
| - name: F1 | |
| type: f1 | |
| value: 0.2807 | |
| - name: PR-AUC | |
| type: pr_auc | |
| value: 0.2896 | |
| - name: ROC-AUC | |
| type: roc_auc | |
| value: 0.7231 | |
| - name: Precision | |
| type: precision | |
| value: 0.2500 | |
| - name: Recall | |
| type: recall | |
| value: 0.3200 | |
| # XGBoost Jailbreak Prediction Model: phi4:14b | |
| XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations. | |
| ## Evaluation Results (best fold: 1) | |
| | Metric | Value | | |
| |----------------|--------| | |
| | F1 | 0.2807 | | |
| | PR-AUC | 0.2896 | | |
| | ROC-AUC | 0.7231 | | |
| | Precision | 0.2500 | | |
| | Recall | 0.3200 | | |
| | Best Threshold | 0.20 | | |
| ## Training Details | |
| - **Target model**: `phi4:14b` | |
| - **Datasets**: harmful_behaviors | |
| - **K-Folds**: 5 | |
| - **Input format**: single turn: category + strategy_name + one TURN line | |
| - **TF-IDF ngram_range**: `(1, 1)` | |
| - **TF-IDF max_features**: `120000` | |
| - **TruncatedSVD**: enabled `True`, requested `n_components=1024` | |
| - **XGBoost n_estimators**: `971` | |
| - **XGBoost learning_rate**: `0.045325359791945935` | |
| - **XGBoost max_depth**: `7` | |
| ## Dataset Size (training samples) | |
| Prepared turn-level samples: 1611 (unsafe: 119, safe: 1492) | |