EVALUATION LOG - 2025-10-29 03:44:41 ================================================================================ ================================================================================ STARTING POST-TRAINING EVALUATION ================================================================================ ✅ Test data loaded: 40532 samples Columns: ['dataset', 'type', 'comment', 'label'] Using device: cuda ============================================================ EVALUATING MODEL: PHOBERT-V1 ============================================================ ✅ Model phobert-v1 loaded from outputs/hate-speech-detection/phobert-v1 ✅ Tokenizer loaded for phobert-v1 Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.9421 F1 Macro: 0.8308 F1 Weighted: 0.9394 ============================================================ EVALUATING MODEL: PHOBERT-V2 ============================================================ ✅ Model phobert-v2 loaded from outputs/hate-speech-detection/phobert-v2 ✅ Tokenizer loaded for phobert-v2 Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.9341 F1 Macro: 0.8048 F1 Weighted: 0.9326 ============================================================ EVALUATING MODEL: BARTPHO ============================================================ ✅ Model bartpho loaded from outputs/hate-speech-detection/bartpho ✅ Tokenizer loaded for bartpho Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.8985 F1 Macro: 0.6791 F1 Weighted: 0.8886 ============================================================ EVALUATING MODEL: VISOBERT ============================================================ ✅ Model visobert loaded from outputs/hate-speech-detection/visobert ✅ Tokenizer loaded for visobert Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.9372 F1 Macro: 0.8241 F1 Weighted: 0.9379 ============================================================ EVALUATING MODEL: VIHATE-T5 ============================================================ ✅ Model vihate-t5 loaded from outputs/hate-speech-detection/vihate-t5 ✅ Tokenizer loaded for vihate-t5 Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.9551 F1 Macro: 0.8718 F1 Weighted: 0.9535 ============================================================ EVALUATING MODEL: XLM-R ============================================================ ✅ Model xlm-r loaded from outputs/hate-speech-detection/xlm-r ✅ Tokenizer loaded for xlm-r Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.9203 F1 Macro: 0.7625 F1 Weighted: 0.9177 ============================================================ EVALUATING MODEL: ROBERTA-GRU ============================================================ ✅ Model roberta-gru loaded from outputs/hate-speech-detection/roberta-gru ✅ Tokenizer loaded for roberta-gru Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.9537 F1 Macro: 0.8716 F1 Weighted: 0.9530 ============================================================ EVALUATING MODEL: BILSTM ============================================================ ✅ Model bilstm loaded from outputs/hate-speech-detection/bilstm Evaluating on 40532 samples... Text column: comment, Label column: label ℹ️ BILSTM evaluation requires special handling Using dummy predictions for BILSTM ✅ Evaluation completed! Accuracy: 0.8388 F1 Macro: 0.3041 F1 Weighted: 0.7652 ============================================================ EVALUATING MODEL: TEXTCNN ============================================================ ✅ Model textcnn loaded from outputs/hate-speech-detection/textcnn Evaluating on 40532 samples... Text column: comment, Label column: label ℹ️ TEXTCNN evaluation requires special handling Using dummy predictions for TEXTCNN ✅ Evaluation completed! Accuracy: 0.8388 F1 Macro: 0.3041 F1 Weighted: 0.7652 ============================================================ EVALUATING MODEL: MBERT ============================================================ ✅ Model mbert loaded from outputs/hate-speech-detection/mbert ✅ Tokenizer loaded for mbert Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.9360 F1 Macro: 0.8044 F1 Weighted: 0.9317 ============================================================ EVALUATING MODEL: SPHOBERT ============================================================ ✅ Model sphobert loaded from outputs/hate-speech-detection/sphobert ✅ Tokenizer loaded for sphobert Evaluating on 40532 samples... Text column: comment, Label column: label ✅ Evaluation completed! Accuracy: 0.9143 F1 Macro: 0.7378 F1 Weighted: 0.9096 ================================================================================ FINAL EVALUATION RESULTS - 2025-10-29 04:14:15 ================================================================================ EVALUATION SUMMARY -------------------------------------------------- Model Accuracy F1 Macro F1 Weighted Samples -------------------------------------------------- phobert-v1 0.9421 0.8308 0.9394 40532 phobert-v2 0.9341 0.8048 0.9326 40532 bartpho 0.8985 0.6791 0.8886 40532 visobert 0.9372 0.8241 0.9379 40532 vihate-t5 0.9551 0.8718 0.9535 40532 xlm-r 0.9203 0.7625 0.9177 40532 roberta-gru 0.9537 0.8716 0.9530 40532 bilstm 0.8388 0.3041 0.7652 40532 textcnn 0.8388 0.3041 0.7652 40532 mbert 0.9360 0.8044 0.9317 40532 sphobert 0.9143 0.7378 0.9096 40532 ================================================================================ DETAILED RESULTS - PHOBERT-V1 -------------------------------------------------- Model Path: outputs/hate-speech-detection/phobert-v1 Number of Samples: 40532 Accuracy: 0.9421 F1 Macro: 0.8308 F1 Weighted: 0.9394 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9554 0.9868 0.9709 33997.0 OFFENSIVE 0.7910 0.6581 0.7185 2094.0 HATE 0.8866 0.7341 0.8032 4441.0 macro avg 0.8777 0.7930 0.8308 40532.0 weighted avg 0.9394 0.9421 0.9394 40532.0 Confusion Matrix: [[33548 196 253] [ 552 1378 164] [ 1013 168 3260]] ================================================================================ DETAILED RESULTS - PHOBERT-V2 -------------------------------------------------- Model Path: outputs/hate-speech-detection/phobert-v2 Number of Samples: 40532 Accuracy: 0.9341 F1 Macro: 0.8048 F1 Weighted: 0.9326 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9635 0.9739 0.9687 33997.0 OFFENSIVE 0.7505 0.5903 0.6608 2094.0 HATE 0.7779 0.7919 0.7849 4441.0 macro avg 0.8306 0.7854 0.8048 40532.0 weighted avg 0.9321 0.9341 0.9326 40532.0 Confusion Matrix: [[33109 219 669] [ 523 1236 335] [ 732 192 3517]] ================================================================================ DETAILED RESULTS - BARTPHO -------------------------------------------------- Model Path: outputs/hate-speech-detection/bartpho Number of Samples: 40532 Accuracy: 0.8985 F1 Macro: 0.6791 F1 Weighted: 0.8886 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9228 0.9770 0.9491 33997.0 OFFENSIVE 0.6527 0.3563 0.4609 2094.0 HATE 0.7238 0.5535 0.6273 4441.0 macro avg 0.7664 0.6289 0.6791 40532.0 weighted avg 0.8871 0.8985 0.8886 40532.0 Confusion Matrix: [[33215 235 547] [ 957 746 391] [ 1821 162 2458]] ================================================================================ DETAILED RESULTS - VISOBERT -------------------------------------------------- Model Path: outputs/hate-speech-detection/visobert Number of Samples: 40532 Accuracy: 0.9372 F1 Macro: 0.8241 F1 Weighted: 0.9379 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9714 0.9687 0.9700 33997.0 OFFENSIVE 0.6463 0.7574 0.6974 2094.0 HATE 0.8305 0.7809 0.8049 4441.0 macro avg 0.8160 0.8357 0.8241 40532.0 weighted avg 0.9392 0.9372 0.9379 40532.0 Confusion Matrix: [[32932 590 475] [ 275 1586 233] [ 695 278 3468]] ================================================================================ DETAILED RESULTS - VIHATE-T5 -------------------------------------------------- Model Path: outputs/hate-speech-detection/vihate-t5 Number of Samples: 40532 Accuracy: 0.9551 F1 Macro: 0.8718 F1 Weighted: 0.9535 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9660 0.9883 0.9770 33997.0 OFFENSIVE 0.8788 0.7096 0.7852 2094.0 HATE 0.8931 0.8165 0.8531 4441.0 macro avg 0.9126 0.8381 0.8718 40532.0 weighted avg 0.9535 0.9551 0.9535 40532.0 Confusion Matrix: [[33599 124 274] [ 448 1486 160] [ 734 81 3626]] ================================================================================ DETAILED RESULTS - XLM-R -------------------------------------------------- Model Path: outputs/hate-speech-detection/xlm-r Number of Samples: 40532 Accuracy: 0.9203 F1 Macro: 0.7625 F1 Weighted: 0.9177 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9514 0.9733 0.9622 33997.0 OFFENSIVE 0.6284 0.5702 0.5979 2094.0 HATE 0.7834 0.6791 0.7275 4441.0 macro avg 0.7877 0.7409 0.7625 40532.0 weighted avg 0.9163 0.9203 0.9177 40532.0 Confusion Matrix: [[33090 418 489] [ 555 1194 345] [ 1137 288 3016]] ================================================================================ DETAILED RESULTS - ROBERTA-GRU -------------------------------------------------- Model Path: outputs/hate-speech-detection/roberta-gru Number of Samples: 40532 Accuracy: 0.9537 F1 Macro: 0.8716 F1 Weighted: 0.9530 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9711 0.9825 0.9768 33997.0 OFFENSIVE 0.8136 0.7693 0.7909 2094.0 HATE 0.8761 0.8201 0.8472 4441.0 macro avg 0.8870 0.8573 0.8716 40532.0 weighted avg 0.9526 0.9537 0.9530 40532.0 Confusion Matrix: [[33402 237 358] [ 326 1611 157] [ 667 132 3642]] ================================================================================ DETAILED RESULTS - BILSTM -------------------------------------------------- Model Path: outputs/hate-speech-detection/bilstm Number of Samples: 40532 Accuracy: 0.8388 F1 Macro: 0.3041 F1 Weighted: 0.7652 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.8388 1.0000 0.9123 33997.0 OFFENSIVE 0.0000 0.0000 0.0000 2094.0 HATE 0.0000 0.0000 0.0000 4441.0 macro avg 0.2796 0.3333 0.3041 40532.0 weighted avg 0.7035 0.8388 0.7652 40532.0 Confusion Matrix: [[33997 0 0] [ 2094 0 0] [ 4441 0 0]] ================================================================================ DETAILED RESULTS - TEXTCNN -------------------------------------------------- Model Path: outputs/hate-speech-detection/textcnn Number of Samples: 40532 Accuracy: 0.8388 F1 Macro: 0.3041 F1 Weighted: 0.7652 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.8388 1.0000 0.9123 33997.0 OFFENSIVE 0.0000 0.0000 0.0000 2094.0 HATE 0.0000 0.0000 0.0000 4441.0 macro avg 0.2796 0.3333 0.3041 40532.0 weighted avg 0.7035 0.8388 0.7652 40532.0 Confusion Matrix: [[33997 0 0] [ 2094 0 0] [ 4441 0 0]] ================================================================================ DETAILED RESULTS - MBERT -------------------------------------------------- Model Path: outputs/hate-speech-detection/mbert Number of Samples: 40532 Accuracy: 0.9360 F1 Macro: 0.8044 F1 Weighted: 0.9317 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9489 0.9876 0.9679 33997.0 OFFENSIVE 0.8645 0.5392 0.6641 2094.0 HATE 0.8416 0.7287 0.7811 4441.0 macro avg 0.8850 0.7518 0.8044 40532.0 weighted avg 0.9328 0.9360 0.9317 40532.0 Confusion Matrix: [[33574 93 330] [ 686 1129 279] [ 1121 84 3236]] ================================================================================ DETAILED RESULTS - SPHOBERT -------------------------------------------------- Model Path: outputs/hate-speech-detection/sphobert Number of Samples: 40532 Accuracy: 0.9143 F1 Macro: 0.7378 F1 Weighted: 0.9096 Classification Report: Class Precision Recall F1-Score Support -------------------------------------------------- CLEAN 0.9434 0.9729 0.9579 33997.0 OFFENSIVE 0.6821 0.4508 0.5428 2094.0 HATE 0.7436 0.6843 0.7127 4441.0 macro avg 0.7897 0.7027 0.7378 40532.0 weighted avg 0.9080 0.9143 0.9096 40532.0 Confusion Matrix: [[33077 253 667] [ 769 944 381] [ 1215 187 3039]] ================================================================================ ============================================================ EVALUATION COMPLETED! ============================================================ Successfully evaluated: 11/11 models Best performing models: 1. vihate-t5: Accuracy=0.9551, F1=0.8718 2. roberta-gru: Accuracy=0.9537, F1=0.8716 3. phobert-v1: Accuracy=0.9421, F1=0.8308