File size: 1,578 Bytes
2fc3e40 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | ---
language: en
tags:
- xgboost
- jailbreak-detection
- text-classification
model-index:
- name: predict_xgb_phi4_14b
results:
- task:
type: text-classification
name: Jailbreak Detection
metrics:
- name: F1
type: f1
value: 0.2807
- name: PR-AUC
type: pr_auc
value: 0.2896
- name: ROC-AUC
type: roc_auc
value: 0.7231
- name: Precision
type: precision
value: 0.2500
- name: Recall
type: recall
value: 0.3200
---
# XGBoost Jailbreak Prediction Model: phi4:14b
XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations.
## Evaluation Results (best fold: 1)
| Metric | Value |
|----------------|--------|
| F1 | 0.2807 |
| PR-AUC | 0.2896 |
| ROC-AUC | 0.7231 |
| Precision | 0.2500 |
| Recall | 0.3200 |
| Best Threshold | 0.20 |
## Training Details
- **Target model**: `phi4:14b`
- **Datasets**: harmful_behaviors
- **K-Folds**: 5
- **Input format**: single turn: category + strategy_name + one TURN line
- **TF-IDF ngram_range**: `(1, 1)`
- **TF-IDF max_features**: `120000`
- **TruncatedSVD**: enabled `True`, requested `n_components=1024`
- **XGBoost n_estimators**: `971`
- **XGBoost learning_rate**: `0.045325359791945935`
- **XGBoost max_depth**: `7`
## Dataset Size (training samples)
Prepared turn-level samples: 1611 (unsafe: 119, safe: 1492)
|