bartpho-hsd / evaluation_log_bartpho.txt
AnnyNguyen's picture
Upload evaluation_log_bartpho.txt with huggingface_hub
6dc1658 verified
EVALUATION LOG - 2025-10-29 03:44:41
================================================================================
================================================================================
STARTING POST-TRAINING EVALUATION
================================================================================
βœ… Test data loaded: 40532 samples
Columns: ['dataset', 'type', 'comment', 'label']
Using device: cuda
============================================================
EVALUATING MODEL: PHOBERT-V1
============================================================
βœ… Model phobert-v1 loaded from outputs/hate-speech-detection/phobert-v1
βœ… Tokenizer loaded for phobert-v1
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.9421
F1 Macro: 0.8308
F1 Weighted: 0.9394
============================================================
EVALUATING MODEL: PHOBERT-V2
============================================================
βœ… Model phobert-v2 loaded from outputs/hate-speech-detection/phobert-v2
βœ… Tokenizer loaded for phobert-v2
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.9341
F1 Macro: 0.8048
F1 Weighted: 0.9326
============================================================
EVALUATING MODEL: BARTPHO
============================================================
βœ… Model bartpho loaded from outputs/hate-speech-detection/bartpho
βœ… Tokenizer loaded for bartpho
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.8985
F1 Macro: 0.6791
F1 Weighted: 0.8886
============================================================
EVALUATING MODEL: VISOBERT
============================================================
βœ… Model visobert loaded from outputs/hate-speech-detection/visobert
βœ… Tokenizer loaded for visobert
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.9372
F1 Macro: 0.8241
F1 Weighted: 0.9379
============================================================
EVALUATING MODEL: VIHATE-T5
============================================================
βœ… Model vihate-t5 loaded from outputs/hate-speech-detection/vihate-t5
βœ… Tokenizer loaded for vihate-t5
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.9551
F1 Macro: 0.8718
F1 Weighted: 0.9535
============================================================
EVALUATING MODEL: XLM-R
============================================================
βœ… Model xlm-r loaded from outputs/hate-speech-detection/xlm-r
βœ… Tokenizer loaded for xlm-r
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.9203
F1 Macro: 0.7625
F1 Weighted: 0.9177
============================================================
EVALUATING MODEL: ROBERTA-GRU
============================================================
βœ… Model roberta-gru loaded from outputs/hate-speech-detection/roberta-gru
βœ… Tokenizer loaded for roberta-gru
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.9537
F1 Macro: 0.8716
F1 Weighted: 0.9530
============================================================
EVALUATING MODEL: BILSTM
============================================================
βœ… Model bilstm loaded from outputs/hate-speech-detection/bilstm
Evaluating on 40532 samples...
Text column: comment, Label column: label
ℹ️ BILSTM evaluation requires special handling
Using dummy predictions for BILSTM
βœ… Evaluation completed!
Accuracy: 0.8388
F1 Macro: 0.3041
F1 Weighted: 0.7652
============================================================
EVALUATING MODEL: TEXTCNN
============================================================
βœ… Model textcnn loaded from outputs/hate-speech-detection/textcnn
Evaluating on 40532 samples...
Text column: comment, Label column: label
ℹ️ TEXTCNN evaluation requires special handling
Using dummy predictions for TEXTCNN
βœ… Evaluation completed!
Accuracy: 0.8388
F1 Macro: 0.3041
F1 Weighted: 0.7652
============================================================
EVALUATING MODEL: MBERT
============================================================
βœ… Model mbert loaded from outputs/hate-speech-detection/mbert
βœ… Tokenizer loaded for mbert
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.9360
F1 Macro: 0.8044
F1 Weighted: 0.9317
============================================================
EVALUATING MODEL: SPHOBERT
============================================================
βœ… Model sphobert loaded from outputs/hate-speech-detection/sphobert
βœ… Tokenizer loaded for sphobert
Evaluating on 40532 samples...
Text column: comment, Label column: label
βœ… Evaluation completed!
Accuracy: 0.9143
F1 Macro: 0.7378
F1 Weighted: 0.9096
================================================================================
FINAL EVALUATION RESULTS - 2025-10-29 04:14:15
================================================================================
EVALUATION SUMMARY
--------------------------------------------------
Model Accuracy F1 Macro F1 Weighted Samples
--------------------------------------------------
phobert-v1 0.9421 0.8308 0.9394 40532
phobert-v2 0.9341 0.8048 0.9326 40532
bartpho 0.8985 0.6791 0.8886 40532
visobert 0.9372 0.8241 0.9379 40532
vihate-t5 0.9551 0.8718 0.9535 40532
xlm-r 0.9203 0.7625 0.9177 40532
roberta-gru 0.9537 0.8716 0.9530 40532
bilstm 0.8388 0.3041 0.7652 40532
textcnn 0.8388 0.3041 0.7652 40532
mbert 0.9360 0.8044 0.9317 40532
sphobert 0.9143 0.7378 0.9096 40532
================================================================================
DETAILED RESULTS - PHOBERT-V1
--------------------------------------------------
Model Path: outputs/hate-speech-detection/phobert-v1
Number of Samples: 40532
Accuracy: 0.9421
F1 Macro: 0.8308
F1 Weighted: 0.9394
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9554 0.9868 0.9709 33997.0
OFFENSIVE 0.7910 0.6581 0.7185 2094.0
HATE 0.8866 0.7341 0.8032 4441.0
macro avg 0.8777 0.7930 0.8308 40532.0
weighted avg 0.9394 0.9421 0.9394 40532.0
Confusion Matrix:
[[33548 196 253]
[ 552 1378 164]
[ 1013 168 3260]]
================================================================================
DETAILED RESULTS - PHOBERT-V2
--------------------------------------------------
Model Path: outputs/hate-speech-detection/phobert-v2
Number of Samples: 40532
Accuracy: 0.9341
F1 Macro: 0.8048
F1 Weighted: 0.9326
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9635 0.9739 0.9687 33997.0
OFFENSIVE 0.7505 0.5903 0.6608 2094.0
HATE 0.7779 0.7919 0.7849 4441.0
macro avg 0.8306 0.7854 0.8048 40532.0
weighted avg 0.9321 0.9341 0.9326 40532.0
Confusion Matrix:
[[33109 219 669]
[ 523 1236 335]
[ 732 192 3517]]
================================================================================
DETAILED RESULTS - BARTPHO
--------------------------------------------------
Model Path: outputs/hate-speech-detection/bartpho
Number of Samples: 40532
Accuracy: 0.8985
F1 Macro: 0.6791
F1 Weighted: 0.8886
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9228 0.9770 0.9491 33997.0
OFFENSIVE 0.6527 0.3563 0.4609 2094.0
HATE 0.7238 0.5535 0.6273 4441.0
macro avg 0.7664 0.6289 0.6791 40532.0
weighted avg 0.8871 0.8985 0.8886 40532.0
Confusion Matrix:
[[33215 235 547]
[ 957 746 391]
[ 1821 162 2458]]
================================================================================
DETAILED RESULTS - VISOBERT
--------------------------------------------------
Model Path: outputs/hate-speech-detection/visobert
Number of Samples: 40532
Accuracy: 0.9372
F1 Macro: 0.8241
F1 Weighted: 0.9379
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9714 0.9687 0.9700 33997.0
OFFENSIVE 0.6463 0.7574 0.6974 2094.0
HATE 0.8305 0.7809 0.8049 4441.0
macro avg 0.8160 0.8357 0.8241 40532.0
weighted avg 0.9392 0.9372 0.9379 40532.0
Confusion Matrix:
[[32932 590 475]
[ 275 1586 233]
[ 695 278 3468]]
================================================================================
DETAILED RESULTS - VIHATE-T5
--------------------------------------------------
Model Path: outputs/hate-speech-detection/vihate-t5
Number of Samples: 40532
Accuracy: 0.9551
F1 Macro: 0.8718
F1 Weighted: 0.9535
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9660 0.9883 0.9770 33997.0
OFFENSIVE 0.8788 0.7096 0.7852 2094.0
HATE 0.8931 0.8165 0.8531 4441.0
macro avg 0.9126 0.8381 0.8718 40532.0
weighted avg 0.9535 0.9551 0.9535 40532.0
Confusion Matrix:
[[33599 124 274]
[ 448 1486 160]
[ 734 81 3626]]
================================================================================
DETAILED RESULTS - XLM-R
--------------------------------------------------
Model Path: outputs/hate-speech-detection/xlm-r
Number of Samples: 40532
Accuracy: 0.9203
F1 Macro: 0.7625
F1 Weighted: 0.9177
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9514 0.9733 0.9622 33997.0
OFFENSIVE 0.6284 0.5702 0.5979 2094.0
HATE 0.7834 0.6791 0.7275 4441.0
macro avg 0.7877 0.7409 0.7625 40532.0
weighted avg 0.9163 0.9203 0.9177 40532.0
Confusion Matrix:
[[33090 418 489]
[ 555 1194 345]
[ 1137 288 3016]]
================================================================================
DETAILED RESULTS - ROBERTA-GRU
--------------------------------------------------
Model Path: outputs/hate-speech-detection/roberta-gru
Number of Samples: 40532
Accuracy: 0.9537
F1 Macro: 0.8716
F1 Weighted: 0.9530
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9711 0.9825 0.9768 33997.0
OFFENSIVE 0.8136 0.7693 0.7909 2094.0
HATE 0.8761 0.8201 0.8472 4441.0
macro avg 0.8870 0.8573 0.8716 40532.0
weighted avg 0.9526 0.9537 0.9530 40532.0
Confusion Matrix:
[[33402 237 358]
[ 326 1611 157]
[ 667 132 3642]]
================================================================================
DETAILED RESULTS - BILSTM
--------------------------------------------------
Model Path: outputs/hate-speech-detection/bilstm
Number of Samples: 40532
Accuracy: 0.8388
F1 Macro: 0.3041
F1 Weighted: 0.7652
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.8388 1.0000 0.9123 33997.0
OFFENSIVE 0.0000 0.0000 0.0000 2094.0
HATE 0.0000 0.0000 0.0000 4441.0
macro avg 0.2796 0.3333 0.3041 40532.0
weighted avg 0.7035 0.8388 0.7652 40532.0
Confusion Matrix:
[[33997 0 0]
[ 2094 0 0]
[ 4441 0 0]]
================================================================================
DETAILED RESULTS - TEXTCNN
--------------------------------------------------
Model Path: outputs/hate-speech-detection/textcnn
Number of Samples: 40532
Accuracy: 0.8388
F1 Macro: 0.3041
F1 Weighted: 0.7652
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.8388 1.0000 0.9123 33997.0
OFFENSIVE 0.0000 0.0000 0.0000 2094.0
HATE 0.0000 0.0000 0.0000 4441.0
macro avg 0.2796 0.3333 0.3041 40532.0
weighted avg 0.7035 0.8388 0.7652 40532.0
Confusion Matrix:
[[33997 0 0]
[ 2094 0 0]
[ 4441 0 0]]
================================================================================
DETAILED RESULTS - MBERT
--------------------------------------------------
Model Path: outputs/hate-speech-detection/mbert
Number of Samples: 40532
Accuracy: 0.9360
F1 Macro: 0.8044
F1 Weighted: 0.9317
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9489 0.9876 0.9679 33997.0
OFFENSIVE 0.8645 0.5392 0.6641 2094.0
HATE 0.8416 0.7287 0.7811 4441.0
macro avg 0.8850 0.7518 0.8044 40532.0
weighted avg 0.9328 0.9360 0.9317 40532.0
Confusion Matrix:
[[33574 93 330]
[ 686 1129 279]
[ 1121 84 3236]]
================================================================================
DETAILED RESULTS - SPHOBERT
--------------------------------------------------
Model Path: outputs/hate-speech-detection/sphobert
Number of Samples: 40532
Accuracy: 0.9143
F1 Macro: 0.7378
F1 Weighted: 0.9096
Classification Report:
Class Precision Recall F1-Score Support
--------------------------------------------------
CLEAN 0.9434 0.9729 0.9579 33997.0
OFFENSIVE 0.6821 0.4508 0.5428 2094.0
HATE 0.7436 0.6843 0.7127 4441.0
macro avg 0.7897 0.7027 0.7378 40532.0
weighted avg 0.9080 0.9143 0.9096 40532.0
Confusion Matrix:
[[33077 253 667]
[ 769 944 381]
[ 1215 187 3039]]
================================================================================
============================================================
EVALUATION COMPLETED!
============================================================
Successfully evaluated: 11/11 models
Best performing models:
1. vihate-t5: Accuracy=0.9551, F1=0.8718
2. roberta-gru: Accuracy=0.9537, F1=0.8716
3. phobert-v1: Accuracy=0.9421, F1=0.8308