metadata
language: en
tags:
- xgboost
- jailbreak-detection
- text-classification
model-index:
- name: predict_xgb_phi4_14b
results:
- task:
type: text-classification
name: Jailbreak Detection
metrics:
- name: F1
type: f1
value: 0.2807
- name: PR-AUC
type: pr_auc
value: 0.2896
- name: ROC-AUC
type: roc_auc
value: 0.7231
- name: Precision
type: precision
value: 0.25
- name: Recall
type: recall
value: 0.32
XGBoost Jailbreak Prediction Model: phi4:14b
XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results (best fold: 1)
| Metric | Value |
|---|---|
| F1 | 0.2807 |
| PR-AUC | 0.2896 |
| ROC-AUC | 0.7231 |
| Precision | 0.2500 |
| Recall | 0.3200 |
| Best Threshold | 0.20 |
Training Details
- Target model:
phi4:14b - Datasets: harmful_behaviors
- K-Folds: 5
- Input format: single turn: category + strategy_name + one TURN line
- TF-IDF ngram_range:
(1, 1) - TF-IDF max_features:
120000 - TruncatedSVD: enabled
True, requestedn_components=1024 - XGBoost n_estimators:
971 - XGBoost learning_rate:
0.045325359791945935 - XGBoost max_depth:
7
Dataset Size (training samples)
Prepared turn-level samples: 1611 (unsafe: 119, safe: 1492)