GNN Jailbreak Prediction Model (gemma4:e4b)
Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results
| Metric | Value |
|---|---|
| F1 | 0.8937 |
| PR-AUC | 0.9358 |
| ROC-AUC | 0.9513 |
| Precision | 0.9589 |
| Recall | 0.8379 |
| Best Threshold | 0.750 |
Training Details
- Target model:
gemma4:e4b - Datasets: harmbench
- Split column:
goal - Seed:
42 - Sentence model:
sentence-transformers/all-MiniLM-L6-v2 - Hidden channels:
128 - Num layers:
2 - Dropout:
0.3
Dataset Size (training samples)
Prepared turn-level samples: 425
Evaluation results
- F1self-reported0.894
- PR-AUCself-reported0.936
- ROC-AUCself-reported0.951
- Precisionself-reported0.959
- Recallself-reported0.838