XGBoost Jailbreak Prediction Model: llama3.2:3b
XGBoost + TF-IDF classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results (best fold: 2)
| Metric | Value |
|---|---|
| F1 | 0.8132 |
| PR-AUC | 0.8926 |
| ROC-AUC | 0.9618 |
| Precision | 0.8605 |
| Recall | 0.7708 |
| Best Threshold | 0.50 |
Training Details
- Target model:
llama3.2:3b - Datasets: HarmBench
- K-Folds: 5
- Input format: category + goal + turns
- TF-IDF ngram_range:
(1, 1) - TF-IDF max_features:
120000 - XGBoost n_estimators:
1203 - XGBoost learning_rate:
0.03753016568794975 - XGBoost max_depth:
5
Dataset Size (before turn expansion)
Original rows (after cleaning and balancing): 433 (unsafe: 0, safe: 0)
Evaluation results
- F1self-reported0.813
- PR-AUCself-reported0.893
- ROC-AUCself-reported0.962
- Precisionself-reported0.861
- Recallself-reported0.771