XGBoost Jailbreak Prediction Model: llama3:8b
XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results (best fold: 4)
| Metric | Value |
|---|---|
| F1 | 0.2609 |
| PR-AUC | 0.2719 |
| ROC-AUC | 0.6688 |
| Precision | 0.2308 |
| Recall | 0.3000 |
| Best Threshold | 0.20 |
Training Details
- Target model:
llama3:8b - Datasets: harmful_behaviors
- K-Folds: 5
- Input format: single turn: category + strategy_name + one TURN line
- TF-IDF ngram_range:
(1, 2) - TF-IDF max_features:
120000 - TruncatedSVD: enabled
True, requestedn_components=512 - XGBoost n_estimators:
811 - XGBoost learning_rate:
0.06473007561613613 - XGBoost max_depth:
5
Dataset Size (training samples)
Prepared turn-level samples: 524 (unsafe: 41, safe: 483)
Evaluation results
- F1self-reported0.261
- PR-AUCself-reported0.272
- ROC-AUCself-reported0.669
- Precisionself-reported0.231
- Recallself-reported0.300