yonad2008
/

predict_gnn_gemma4_e4b

Text Classification

jailbreak-detection

Eval Results (legacy)

Model card Files Files and versions

predict_gnn_gemma4_e4b / README.md

yonad2008's picture

Upload GNN turn-level model artifacts

6b8cf7b verified about 1 month ago

|

history blame contribute delete

1.34 kB

	---
	language: en
	tags:
	- gnn
	- jailbreak-detection
	- text-classification
	model-index:
	- name: predict_gnn_gemma4_e4b
	results:
	- task:
	type: text-classification
	name: Jailbreak Detection
	metrics:
	- name: F1
	type: f1
	value: 0.8937
	- name: PR-AUC
	type: pr_auc
	value: 0.9358
	- name: ROC-AUC
	type: roc_auc
	value: 0.9513
	- name: Precision
	type: precision
	value: 0.9589
	- name: Recall
	type: recall
	value: 0.8379
	---
	# GNN Jailbreak Prediction Model (gemma4:e4b)

	Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations.

	## Evaluation Results

	\| Metric \| Value \|
	\|----------------\|--------\|
	\| F1 \| 0.8937 \|
	\| PR-AUC \| 0.9358 \|
	\| ROC-AUC \| 0.9513 \|
	\| Precision \| 0.9589 \|
	\| Recall \| 0.8379 \|
	\| Best Threshold \| 0.750 \|

	## Training Details

	- Target model: `gemma4:e4b`
	- Datasets: harmbench
	- Split column: `goal`
	- Seed: `42`
	- Sentence model: `sentence-transformers/all-MiniLM-L6-v2`
	- Hidden channels: `128`
	- Num layers: `2`
	- Dropout: `0.3`

	## Dataset Size (training samples)

	Prepared turn-level samples: 425