GNN Jailbreak Prediction Model (gemma4:26b+gemma4:e4b)
Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results
| Metric | Value |
|---|---|
| F1 | 0.7113 |
| PR-AUC | 0.7991 |
| ROC-AUC | 0.9814 |
| Precision | 0.6911 |
| Recall | 0.7328 |
| Best Threshold | 0.610 |
Training Details
- Target model:
gemma4:26b+gemma4:e4b - Datasets: harmbench, harmful_behaviors_1
- Split column:
goal - Seed:
42 - Sentence model:
sentence-transformers/all-MiniLM-L6-v2 - Hidden channels:
128 - Num layers:
2 - Dropout:
0.3
Dataset Size (training samples)
Prepared turn-level samples: 12548
Evaluation results
- F1self-reported0.711
- PR-AUCself-reported0.799
- ROC-AUCself-reported0.981
- Precisionself-reported0.691
- Recallself-reported0.733