yonad2008
/

predict_qwen2.5_7b-instruct

Text Classification

jailbreak-detection

Eval Results (legacy)

Model card Files Files and versions

predict_qwen2.5_7b-instruct / README.md

yonad2008's picture

Upload README.md with huggingface_hub

3a16b4d verified 14 days ago

|

history blame contribute delete

1.41 kB

	---
	language: en
	tags:
	- jailbreak-detection
	- deberta-v3
	- text-classification
	model-index:
	- name: predict_qwen2.5_7b-instruct
	results:
	- task:
	type: text-classification
	name: Jailbreak Detection
	metrics:
	- name: F1
	type: f1
	value: 0.7731
	- name: PR-AUC
	type: pr_auc
	value: 0.8692
	- name: ROC-AUC
	type: roc_auc
	value: 0.9591
	- name: Precision
	type: precision
	value: 0.8364
	- name: Recall
	type: recall
	value: 0.7188
	---
	# Jailbreak Prediction Model: qwen2.5:7b-instruct

	Fine-tuned DeBERTa-v3-base for detecting unsafe/jailbreak prompts in multi-turn conversations.

	## Evaluation Results (best fold: 3)

	\| Metric \| Value \|
	\|----------------\|--------\|
	\| F1 \| 0.7731 \|
	\| PR-AUC \| 0.8692 \|
	\| ROC-AUC \| 0.9591 \|
	\| Precision \| 0.8364 \|
	\| Recall \| 0.7188 \|
	\| Best Threshold \| 0.10 \|

	## Training Details

	- Base model: `microsoft/deberta-v3-base`
	- Target model: `qwen2.5:7b-instruct`
	- Datasets: HarmBench
	- K-Folds: 5
	- Epochs: 5
	- Learning Rate: 2e-05
	- Max Length: 512
	- Input format: turns only

	## Dataset Size (before turn expansion)

	Original rows (after cleaning and balancing): 1536 (unsafe: 308, safe: 1228)