EVALUATION LOG - 2025-10-29 03:44:41
================================================================================


================================================================================
STARTING POST-TRAINING EVALUATION
================================================================================
✅ Test data loaded: 40532 samples
   Columns: ['dataset', 'type', 'comment', 'label']
Using device: cuda

============================================================
EVALUATING MODEL: PHOBERT-V1
============================================================
✅ Model phobert-v1 loaded from outputs/hate-speech-detection/phobert-v1
✅ Tokenizer loaded for phobert-v1
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.9421
   F1 Macro: 0.8308
   F1 Weighted: 0.9394

============================================================
EVALUATING MODEL: PHOBERT-V2
============================================================
✅ Model phobert-v2 loaded from outputs/hate-speech-detection/phobert-v2
✅ Tokenizer loaded for phobert-v2
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.9341
   F1 Macro: 0.8048
   F1 Weighted: 0.9326

============================================================
EVALUATING MODEL: BARTPHO
============================================================
✅ Model bartpho loaded from outputs/hate-speech-detection/bartpho
✅ Tokenizer loaded for bartpho
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.8985
   F1 Macro: 0.6791
   F1 Weighted: 0.8886

============================================================
EVALUATING MODEL: VISOBERT
============================================================
✅ Model visobert loaded from outputs/hate-speech-detection/visobert
✅ Tokenizer loaded for visobert
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.9372
   F1 Macro: 0.8241
   F1 Weighted: 0.9379

============================================================
EVALUATING MODEL: VIHATE-T5
============================================================
✅ Model vihate-t5 loaded from outputs/hate-speech-detection/vihate-t5
✅ Tokenizer loaded for vihate-t5
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.9551
   F1 Macro: 0.8718
   F1 Weighted: 0.9535

============================================================
EVALUATING MODEL: XLM-R
============================================================
✅ Model xlm-r loaded from outputs/hate-speech-detection/xlm-r
✅ Tokenizer loaded for xlm-r
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.9203
   F1 Macro: 0.7625
   F1 Weighted: 0.9177

============================================================
EVALUATING MODEL: ROBERTA-GRU
============================================================
✅ Model roberta-gru loaded from outputs/hate-speech-detection/roberta-gru
✅ Tokenizer loaded for roberta-gru
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.9537
   F1 Macro: 0.8716
   F1 Weighted: 0.9530

============================================================
EVALUATING MODEL: BILSTM
============================================================
✅ Model bilstm loaded from outputs/hate-speech-detection/bilstm
Evaluating on 40532 samples...
Text column: comment, Label column: label
ℹ️  BILSTM evaluation requires special handling
Using dummy predictions for BILSTM
✅ Evaluation completed!
   Accuracy: 0.8388
   F1 Macro: 0.3041
   F1 Weighted: 0.7652

============================================================
EVALUATING MODEL: TEXTCNN
============================================================
✅ Model textcnn loaded from outputs/hate-speech-detection/textcnn
Evaluating on 40532 samples...
Text column: comment, Label column: label
ℹ️  TEXTCNN evaluation requires special handling
Using dummy predictions for TEXTCNN
✅ Evaluation completed!
   Accuracy: 0.8388
   F1 Macro: 0.3041
   F1 Weighted: 0.7652

============================================================
EVALUATING MODEL: MBERT
============================================================
✅ Model mbert loaded from outputs/hate-speech-detection/mbert
✅ Tokenizer loaded for mbert
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.9360
   F1 Macro: 0.8044
   F1 Weighted: 0.9317

============================================================
EVALUATING MODEL: SPHOBERT
============================================================
✅ Model sphobert loaded from outputs/hate-speech-detection/sphobert
✅ Tokenizer loaded for sphobert
Evaluating on 40532 samples...
Text column: comment, Label column: label
✅ Evaluation completed!
   Accuracy: 0.9143
   F1 Macro: 0.7378
   F1 Weighted: 0.9096


================================================================================
FINAL EVALUATION RESULTS - 2025-10-29 04:14:15
================================================================================

EVALUATION SUMMARY
--------------------------------------------------
Model                Accuracy   F1 Macro   F1 Weighted  Samples 
--------------------------------------------------
phobert-v1           0.9421     0.8308     0.9394       40532   
phobert-v2           0.9341     0.8048     0.9326       40532   
bartpho              0.8985     0.6791     0.8886       40532   
visobert             0.9372     0.8241     0.9379       40532   
vihate-t5            0.9551     0.8718     0.9535       40532   
xlm-r                0.9203     0.7625     0.9177       40532   
roberta-gru          0.9537     0.8716     0.9530       40532   
bilstm               0.8388     0.3041     0.7652       40532   
textcnn              0.8388     0.3041     0.7652       40532   
mbert                0.9360     0.8044     0.9317       40532   
sphobert             0.9143     0.7378     0.9096       40532   

================================================================================

DETAILED RESULTS - PHOBERT-V1
--------------------------------------------------
Model Path: outputs/hate-speech-detection/phobert-v1
Number of Samples: 40532
Accuracy: 0.9421
F1 Macro: 0.8308
F1 Weighted: 0.9394

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9554     0.9868     0.9709     33997.0 
OFFENSIVE  0.7910     0.6581     0.7185     2094.0  
HATE       0.8866     0.7341     0.8032     4441.0  
macro avg  0.8777     0.7930     0.8308     40532.0 
weighted avg 0.9394     0.9421     0.9394     40532.0 

Confusion Matrix:
[[33548   196   253]
 [  552  1378   164]
 [ 1013   168  3260]]

================================================================================

DETAILED RESULTS - PHOBERT-V2
--------------------------------------------------
Model Path: outputs/hate-speech-detection/phobert-v2
Number of Samples: 40532
Accuracy: 0.9341
F1 Macro: 0.8048
F1 Weighted: 0.9326

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9635     0.9739     0.9687     33997.0 
OFFENSIVE  0.7505     0.5903     0.6608     2094.0  
HATE       0.7779     0.7919     0.7849     4441.0  
macro avg  0.8306     0.7854     0.8048     40532.0 
weighted avg 0.9321     0.9341     0.9326     40532.0 

Confusion Matrix:
[[33109   219   669]
 [  523  1236   335]
 [  732   192  3517]]

================================================================================

DETAILED RESULTS - BARTPHO
--------------------------------------------------
Model Path: outputs/hate-speech-detection/bartpho
Number of Samples: 40532
Accuracy: 0.8985
F1 Macro: 0.6791
F1 Weighted: 0.8886

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9228     0.9770     0.9491     33997.0 
OFFENSIVE  0.6527     0.3563     0.4609     2094.0  
HATE       0.7238     0.5535     0.6273     4441.0  
macro avg  0.7664     0.6289     0.6791     40532.0 
weighted avg 0.8871     0.8985     0.8886     40532.0 

Confusion Matrix:
[[33215   235   547]
 [  957   746   391]
 [ 1821   162  2458]]

================================================================================

DETAILED RESULTS - VISOBERT
--------------------------------------------------
Model Path: outputs/hate-speech-detection/visobert
Number of Samples: 40532
Accuracy: 0.9372
F1 Macro: 0.8241
F1 Weighted: 0.9379

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9714     0.9687     0.9700     33997.0 
OFFENSIVE  0.6463     0.7574     0.6974     2094.0  
HATE       0.8305     0.7809     0.8049     4441.0  
macro avg  0.8160     0.8357     0.8241     40532.0 
weighted avg 0.9392     0.9372     0.9379     40532.0 

Confusion Matrix:
[[32932   590   475]
 [  275  1586   233]
 [  695   278  3468]]

================================================================================

DETAILED RESULTS - VIHATE-T5
--------------------------------------------------
Model Path: outputs/hate-speech-detection/vihate-t5
Number of Samples: 40532
Accuracy: 0.9551
F1 Macro: 0.8718
F1 Weighted: 0.9535

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9660     0.9883     0.9770     33997.0 
OFFENSIVE  0.8788     0.7096     0.7852     2094.0  
HATE       0.8931     0.8165     0.8531     4441.0  
macro avg  0.9126     0.8381     0.8718     40532.0 
weighted avg 0.9535     0.9551     0.9535     40532.0 

Confusion Matrix:
[[33599   124   274]
 [  448  1486   160]
 [  734    81  3626]]

================================================================================

DETAILED RESULTS - XLM-R
--------------------------------------------------
Model Path: outputs/hate-speech-detection/xlm-r
Number of Samples: 40532
Accuracy: 0.9203
F1 Macro: 0.7625
F1 Weighted: 0.9177

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9514     0.9733     0.9622     33997.0 
OFFENSIVE  0.6284     0.5702     0.5979     2094.0  
HATE       0.7834     0.6791     0.7275     4441.0  
macro avg  0.7877     0.7409     0.7625     40532.0 
weighted avg 0.9163     0.9203     0.9177     40532.0 

Confusion Matrix:
[[33090   418   489]
 [  555  1194   345]
 [ 1137   288  3016]]

================================================================================

DETAILED RESULTS - ROBERTA-GRU
--------------------------------------------------
Model Path: outputs/hate-speech-detection/roberta-gru
Number of Samples: 40532
Accuracy: 0.9537
F1 Macro: 0.8716
F1 Weighted: 0.9530

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9711     0.9825     0.9768     33997.0 
OFFENSIVE  0.8136     0.7693     0.7909     2094.0  
HATE       0.8761     0.8201     0.8472     4441.0  
macro avg  0.8870     0.8573     0.8716     40532.0 
weighted avg 0.9526     0.9537     0.9530     40532.0 

Confusion Matrix:
[[33402   237   358]
 [  326  1611   157]
 [  667   132  3642]]

================================================================================

DETAILED RESULTS - BILSTM
--------------------------------------------------
Model Path: outputs/hate-speech-detection/bilstm
Number of Samples: 40532
Accuracy: 0.8388
F1 Macro: 0.3041
F1 Weighted: 0.7652

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.8388     1.0000     0.9123     33997.0 
OFFENSIVE  0.0000     0.0000     0.0000     2094.0  
HATE       0.0000     0.0000     0.0000     4441.0  
macro avg  0.2796     0.3333     0.3041     40532.0 
weighted avg 0.7035     0.8388     0.7652     40532.0 

Confusion Matrix:
[[33997     0     0]
 [ 2094     0     0]
 [ 4441     0     0]]

================================================================================

DETAILED RESULTS - TEXTCNN
--------------------------------------------------
Model Path: outputs/hate-speech-detection/textcnn
Number of Samples: 40532
Accuracy: 0.8388
F1 Macro: 0.3041
F1 Weighted: 0.7652

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.8388     1.0000     0.9123     33997.0 
OFFENSIVE  0.0000     0.0000     0.0000     2094.0  
HATE       0.0000     0.0000     0.0000     4441.0  
macro avg  0.2796     0.3333     0.3041     40532.0 
weighted avg 0.7035     0.8388     0.7652     40532.0 

Confusion Matrix:
[[33997     0     0]
 [ 2094     0     0]
 [ 4441     0     0]]

================================================================================

DETAILED RESULTS - MBERT
--------------------------------------------------
Model Path: outputs/hate-speech-detection/mbert
Number of Samples: 40532
Accuracy: 0.9360
F1 Macro: 0.8044
F1 Weighted: 0.9317

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9489     0.9876     0.9679     33997.0 
OFFENSIVE  0.8645     0.5392     0.6641     2094.0  
HATE       0.8416     0.7287     0.7811     4441.0  
macro avg  0.8850     0.7518     0.8044     40532.0 
weighted avg 0.9328     0.9360     0.9317     40532.0 

Confusion Matrix:
[[33574    93   330]
 [  686  1129   279]
 [ 1121    84  3236]]

================================================================================

DETAILED RESULTS - SPHOBERT
--------------------------------------------------
Model Path: outputs/hate-speech-detection/sphobert
Number of Samples: 40532
Accuracy: 0.9143
F1 Macro: 0.7378
F1 Weighted: 0.9096

Classification Report:
Class      Precision  Recall     F1-Score   Support 
--------------------------------------------------
CLEAN      0.9434     0.9729     0.9579     33997.0 
OFFENSIVE  0.6821     0.4508     0.5428     2094.0  
HATE       0.7436     0.6843     0.7127     4441.0  
macro avg  0.7897     0.7027     0.7378     40532.0 
weighted avg 0.9080     0.9143     0.9096     40532.0 

Confusion Matrix:
[[33077   253   667]
 [  769   944   381]
 [ 1215   187  3039]]

================================================================================


============================================================
EVALUATION COMPLETED!
============================================================
Successfully evaluated: 11/11 models

Best performing models:
  1. vihate-t5: Accuracy=0.9551, F1=0.8718
  2. roberta-gru: Accuracy=0.9537, F1=0.8716
  3. phobert-v1: Accuracy=0.9421, F1=0.8308