XGBoost Jailbreak Prediction Model: qwen3:8b
XGBoost + TF-IDF classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results (best fold: 1)
| Metric | Value |
|---|---|
| F1 | 0.8244 |
| PR-AUC | 0.8885 |
| ROC-AUC | 0.9692 |
| Precision | 0.7714 |
| Recall | 0.8852 |
| Best Threshold | 0.40 |
Training Details
- Target model:
qwen3:8b - Datasets: HarmBench
- K-Folds: 5
- Input format: category + goal + turns
- TF-IDF ngram_range:
(1, 2) - TF-IDF max_features:
120000 - XGBoost n_estimators:
961 - XGBoost learning_rate:
0.03945475871427614 - XGBoost max_depth:
6
Dataset Size (before turn expansion)
Original rows (after cleaning and balancing): 510 (unsafe: 0, safe: 0)
Evaluation results
- F1self-reported0.824
- PR-AUCself-reported0.888
- ROC-AUCself-reported0.969
- Precisionself-reported0.771
- Recallself-reported0.885