GNN Jailbreak Prediction Model (llama3:8b)
Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations.
Evaluation Results
| Metric | Value |
|---|---|
| F1 | 0.9627 |
| PR-AUC | 0.9923 |
| ROC-AUC | 0.9920 |
| Precision | 0.9700 |
| Recall | 0.9580 |
| Best Threshold | 0.430 |
Training Details
- Target model:
llama3:8b - Datasets: harmbench
- Split column:
goal - Seed:
42 - Sentence model:
sentence-transformers/all-MiniLM-L6-v2 - Hidden channels:
128 - Num layers:
2 - Dropout:
0.3
Dataset Size (training samples)
Prepared turn-level samples: 522
Evaluation results
- F1self-reported0.963
- PR-AUCself-reported0.992
- ROC-AUCself-reported0.992
- Precisionself-reported0.970
- Recallself-reported0.958