yonad2008
/

predict_xgb_phi4_14b

Text Classification

jailbreak-detection

Eval Results (legacy)

Model card Files Files and versions

predict_xgb_phi4_14b / README.md

yonad2008's picture

Upload XGBoost TF-IDF model artifacts

2fc3e40 verified about 1 month ago

|

history blame contribute delete

1.58 kB

	---
	language: en
	tags:
	- xgboost
	- jailbreak-detection
	- text-classification
	model-index:
	- name: predict_xgb_phi4_14b
	results:
	- task:
	type: text-classification
	name: Jailbreak Detection
	metrics:
	- name: F1
	type: f1
	value: 0.2807
	- name: PR-AUC
	type: pr_auc
	value: 0.2896
	- name: ROC-AUC
	type: roc_auc
	value: 0.7231
	- name: Precision
	type: precision
	value: 0.2500
	- name: Recall
	type: recall
	value: 0.3200
	---
	# XGBoost Jailbreak Prediction Model: phi4:14b

	XGBoost + TF-IDF (+ optional TruncatedSVD) classifier for unsafe/jailbreak likelihood in multi-turn conversations.

	## Evaluation Results (best fold: 1)

	\| Metric \| Value \|
	\|----------------\|--------\|
	\| F1 \| 0.2807 \|
	\| PR-AUC \| 0.2896 \|
	\| ROC-AUC \| 0.7231 \|
	\| Precision \| 0.2500 \|
	\| Recall \| 0.3200 \|
	\| Best Threshold \| 0.20 \|

	## Training Details

	- Target model: `phi4:14b`
	- Datasets: harmful_behaviors
	- K-Folds: 5
	- Input format: single turn: category + strategy_name + one TURN line
	- TF-IDF ngram_range: `(1, 1)`
	- TF-IDF max_features: `120000`
	- TruncatedSVD: enabled `True`, requested `n_components=1024`
	- XGBoost n_estimators: `971`
	- XGBoost learning_rate: `0.045325359791945935`
	- XGBoost max_depth: `7`

	## Dataset Size (training samples)

	Prepared turn-level samples: 1611 (unsafe: 119, safe: 1492)