--- language: en tags: - gnn - jailbreak-detection - text-classification model-index: - name: predict_gnn_llama3_8b results: - task: type: text-classification name: Jailbreak Detection metrics: - name: F1 type: f1 value: 0.9627 - name: PR-AUC type: pr_auc value: 0.9923 - name: ROC-AUC type: roc_auc value: 0.9920 - name: Precision type: precision value: 0.9700 - name: Recall type: recall value: 0.9580 --- # GNN Jailbreak Prediction Model (llama3:8b) Homogeneous GNN classifier for unsafe/jailbreak likelihood in multi-turn conversations. ## Evaluation Results | Metric | Value | |----------------|--------| | F1 | 0.9627 | | PR-AUC | 0.9923 | | ROC-AUC | 0.9920 | | Precision | 0.9700 | | Recall | 0.9580 | | Best Threshold | 0.430 | ## Training Details - **Target model**: `llama3:8b` - **Datasets**: harmbench - **Split column**: `goal` - **Seed**: `42` - **Sentence model**: `sentence-transformers/all-MiniLM-L6-v2` - **Hidden channels**: `128` - **Num layers**: `2` - **Dropout**: `0.3` ## Dataset Size (training samples) Prepared turn-level samples: 522