yonad2008's picture
Upload XGBoost TF-IDF model artifacts
2fc3e40 verified
---
language: en
tags:
- xgboost
- jailbreak-detection
- text-classification
model-index:
- name: predict_xgb_phi4_14b
results:
- task:
type: text-classification
name: Jailbreak Detection
metrics:
- name: F1
type: f1
value: 0.2807
- name: PR-AUC
type: pr_auc
value: 0.2896
- name: ROC-AUC
type: roc_auc
value: 0.7231
- name: Precision
type: precision
value: 0.2500
- name: Recall
type: recall
value: 0.3200
---
# XGBoost Jailbreak Prediction Model: phi4:14b
XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations.
## Evaluation Results (best fold: 1)
| Metric | Value |
|----------------|--------|
| F1 | 0.2807 |
| PR-AUC | 0.2896 |
| ROC-AUC | 0.7231 |
| Precision | 0.2500 |
| Recall | 0.3200 |
| Best Threshold | 0.20 |
## Training Details
- **Target model**: `phi4:14b`
- **Datasets**: harmful_behaviors
- **K-Folds**: 5
- **Input format**: single turn: category + strategy_name + one TURN line
- **TF-IDF ngram_range**: `(1, 1)`
- **TF-IDF max_features**: `120000`
- **TruncatedSVD**: enabled `True`, requested `n_components=1024`
- **XGBoost n_estimators**: `971`
- **XGBoost learning_rate**: `0.045325359791945935`
- **XGBoost max_depth**: `7`
## Dataset Size (training samples)
Prepared turn-level samples: 1611 (unsafe: 119, safe: 1492)