GNN Jailbreak Prediction Model (gemma4:26b)
Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results
| Metric | Value |
|---|---|
| F1 | 0.8783 |
| PR-AUC | 0.9738 |
| ROC-AUC | 0.9697 |
| Precision | 0.8554 |
| Recall | 0.9344 |
| Best Threshold | 0.310 |
Training Details
- Target model:
gemma4:26b - Datasets: harmbench
- Split column:
goal - Seed:
42 - Sentence model:
sentence-transformers/all-MiniLM-L6-v2 - Hidden channels:
128 - Num layers:
2 - Dropout:
0.3
Dataset Size (training samples)
Prepared turn-level samples: 517
Evaluation results
- F1self-reported0.878
- PR-AUCself-reported0.974
- ROC-AUCself-reported0.970
- Precisionself-reported0.855
- Recallself-reported0.934