XGBoost Jailbreak Prediction Model: phi4:14b
XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results (best fold: 1)
| Metric | Value |
|---|---|
| F1 | 0.2807 |
| PR-AUC | 0.2896 |
| ROC-AUC | 0.7231 |
| Precision | 0.2500 |
| Recall | 0.3200 |
| Best Threshold | 0.20 |
Training Details
- Target model:
phi4:14b - Datasets: harmful_behaviors
- K-Folds: 5
- Input format: single turn: category + strategy_name + one TURN line
- TF-IDF ngram_range:
(1, 1) - TF-IDF max_features:
120000 - TruncatedSVD: enabled
True, requestedn_components=1024 - XGBoost n_estimators:
971 - XGBoost learning_rate:
0.045325359791945935 - XGBoost max_depth:
7
Dataset Size (training samples)
Prepared turn-level samples: 1611 (unsafe: 119, safe: 1492)
Evaluation results
- F1self-reported0.281
- PR-AUCself-reported0.290
- ROC-AUCself-reported0.723
- Precisionself-reported0.250
- Recallself-reported0.320