File size: 1,578 Bytes
2fc3e40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
language: en
tags:
  - xgboost
  - jailbreak-detection
  - text-classification
model-index:
  - name: predict_xgb_phi4_14b
    results:
      - task:
          type: text-classification
          name: Jailbreak Detection
        metrics:
          - name: F1
            type: f1
            value: 0.2807
          - name: PR-AUC
            type: pr_auc
            value: 0.2896
          - name: ROC-AUC
            type: roc_auc
            value: 0.7231
          - name: Precision
            type: precision
            value: 0.2500
          - name: Recall
            type: recall
            value: 0.3200
---
# XGBoost Jailbreak Prediction Model: phi4:14b

XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations.

## Evaluation Results (best fold: 1)

| Metric         | Value  |
|----------------|--------|
| F1             | 0.2807 |
| PR-AUC         | 0.2896 |
| ROC-AUC        | 0.7231 |
| Precision      | 0.2500 |
| Recall         | 0.3200 |
| Best Threshold | 0.20 |

## Training Details

- **Target model**: `phi4:14b`
- **Datasets**: harmful_behaviors
- **K-Folds**: 5
- **Input format**: single turn: category + strategy_name + one TURN line
- **TF-IDF ngram_range**: `(1, 1)`
- **TF-IDF max_features**: `120000`
- **TruncatedSVD**: enabled `True`, requested `n_components=1024`
- **XGBoost n_estimators**: `971`
- **XGBoost learning_rate**: `0.045325359791945935`
- **XGBoost max_depth**: `7`

## Dataset Size (training samples)

Prepared turn-level samples: 1611 (unsafe: 119, safe: 1492)