visolex
/

bartpho-hsd

@@ -1,443 +0,0 @@
-EVALUATION LOG - 2025-10-29 03:44:41
-================================================================================
-================================================================================
-STARTING POST-TRAINING EVALUATION
-================================================================================
-✅ Test data loaded: 40532 samples
-   Columns: ['dataset', 'type', 'comment', 'label']
-Using device: cuda
-============================================================
-EVALUATING MODEL: PHOBERT-V1
-============================================================
-✅ Model phobert-v1 loaded from outputs/hate-speech-detection/phobert-v1
-✅ Tokenizer loaded for phobert-v1
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.9421
-   F1 Macro: 0.8308
-   F1 Weighted: 0.9394
-============================================================
-EVALUATING MODEL: PHOBERT-V2
-============================================================
-✅ Model phobert-v2 loaded from outputs/hate-speech-detection/phobert-v2
-✅ Tokenizer loaded for phobert-v2
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.9341
-   F1 Macro: 0.8048
-   F1 Weighted: 0.9326
-============================================================
-EVALUATING MODEL: BARTPHO
-============================================================
-✅ Model bartpho loaded from outputs/hate-speech-detection/bartpho
-✅ Tokenizer loaded for bartpho
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.8985
-   F1 Macro: 0.6791
-   F1 Weighted: 0.8886
-============================================================
-EVALUATING MODEL: VISOBERT
-============================================================
-✅ Model visobert loaded from outputs/hate-speech-detection/visobert
-✅ Tokenizer loaded for visobert
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.9372
-   F1 Macro: 0.8241
-   F1 Weighted: 0.9379
-============================================================
-EVALUATING MODEL: VIHATE-T5
-============================================================
-✅ Model vihate-t5 loaded from outputs/hate-speech-detection/vihate-t5
-✅ Tokenizer loaded for vihate-t5
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.9551
-   F1 Macro: 0.8718
-   F1 Weighted: 0.9535
-============================================================
-EVALUATING MODEL: XLM-R
-============================================================
-✅ Model xlm-r loaded from outputs/hate-speech-detection/xlm-r
-✅ Tokenizer loaded for xlm-r
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.9203
-   F1 Macro: 0.7625
-   F1 Weighted: 0.9177
-============================================================
-EVALUATING MODEL: ROBERTA-GRU
-============================================================
-✅ Model roberta-gru loaded from outputs/hate-speech-detection/roberta-gru
-✅ Tokenizer loaded for roberta-gru
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.9537
-   F1 Macro: 0.8716
-   F1 Weighted: 0.9530
-============================================================
-EVALUATING MODEL: BILSTM
-============================================================
-✅ Model bilstm loaded from outputs/hate-speech-detection/bilstm
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-ℹ️  BILSTM evaluation requires special handling
-Using dummy predictions for BILSTM
-✅ Evaluation completed!
-   Accuracy: 0.8388
-   F1 Macro: 0.3041
-   F1 Weighted: 0.7652
-============================================================
-EVALUATING MODEL: TEXTCNN
-============================================================
-✅ Model textcnn loaded from outputs/hate-speech-detection/textcnn
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-ℹ️  TEXTCNN evaluation requires special handling
-Using dummy predictions for TEXTCNN
-✅ Evaluation completed!
-   Accuracy: 0.8388
-   F1 Macro: 0.3041
-   F1 Weighted: 0.7652
-============================================================
-EVALUATING MODEL: MBERT
-============================================================
-✅ Model mbert loaded from outputs/hate-speech-detection/mbert
-✅ Tokenizer loaded for mbert
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.9360
-   F1 Macro: 0.8044
-   F1 Weighted: 0.9317
-============================================================
-EVALUATING MODEL: SPHOBERT
-============================================================
-✅ Model sphobert loaded from outputs/hate-speech-detection/sphobert
-✅ Tokenizer loaded for sphobert
-Evaluating on 40532 samples...
-Text column: comment, Label column: label
-✅ Evaluation completed!
-   Accuracy: 0.9143
-   F1 Macro: 0.7378
-   F1 Weighted: 0.9096
-================================================================================
-FINAL EVALUATION RESULTS - 2025-10-29 04:14:15
-================================================================================
-EVALUATION SUMMARY
---------------------------------------------------
-Model                Accuracy   F1 Macro   F1 Weighted  Samples
---------------------------------------------------
-phobert-v1           0.9421     0.8308     0.9394       40532
-phobert-v2           0.9341     0.8048     0.9326       40532
-bartpho              0.8985     0.6791     0.8886       40532
-visobert             0.9372     0.8241     0.9379       40532
-vihate-t5            0.9551     0.8718     0.9535       40532
-xlm-r                0.9203     0.7625     0.9177       40532
-roberta-gru          0.9537     0.8716     0.9530       40532
-bilstm               0.8388     0.3041     0.7652       40532
-textcnn              0.8388     0.3041     0.7652       40532
-mbert                0.9360     0.8044     0.9317       40532
-sphobert             0.9143     0.7378     0.9096       40532
-================================================================================
-DETAILED RESULTS - PHOBERT-V1
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/phobert-v1
-Number of Samples: 40532
-Accuracy: 0.9421
-F1 Macro: 0.8308
-F1 Weighted: 0.9394
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9554     0.9868     0.9709     33997.0
-OFFENSIVE  0.7910     0.6581     0.7185     2094.0
-HATE       0.8866     0.7341     0.8032     4441.0
-macro avg  0.8777     0.7930     0.8308     40532.0
-weighted avg 0.9394     0.9421     0.9394     40532.0
-Confusion Matrix:
-[[33548   196   253]
- [  552  1378   164]
- [ 1013   168  3260]]
-================================================================================
-DETAILED RESULTS - PHOBERT-V2
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/phobert-v2
-Number of Samples: 40532
-Accuracy: 0.9341
-F1 Macro: 0.8048
-F1 Weighted: 0.9326
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9635     0.9739     0.9687     33997.0
-OFFENSIVE  0.7505     0.5903     0.6608     2094.0
-HATE       0.7779     0.7919     0.7849     4441.0
-macro avg  0.8306     0.7854     0.8048     40532.0
-weighted avg 0.9321     0.9341     0.9326     40532.0
-Confusion Matrix:
-[[33109   219   669]
- [  523  1236   335]
- [  732   192  3517]]
-================================================================================
-DETAILED RESULTS - BARTPHO
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/bartpho
-Number of Samples: 40532
-Accuracy: 0.8985
-F1 Macro: 0.6791
-F1 Weighted: 0.8886
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9228     0.9770     0.9491     33997.0
-OFFENSIVE  0.6527     0.3563     0.4609     2094.0
-HATE       0.7238     0.5535     0.6273     4441.0
-macro avg  0.7664     0.6289     0.6791     40532.0
-weighted avg 0.8871     0.8985     0.8886     40532.0
-Confusion Matrix:
-[[33215   235   547]
- [  957   746   391]
- [ 1821   162  2458]]
-================================================================================
-DETAILED RESULTS - VISOBERT
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/visobert
-Number of Samples: 40532
-Accuracy: 0.9372
-F1 Macro: 0.8241
-F1 Weighted: 0.9379
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9714     0.9687     0.9700     33997.0
-OFFENSIVE  0.6463     0.7574     0.6974     2094.0
-HATE       0.8305     0.7809     0.8049     4441.0
-macro avg  0.8160     0.8357     0.8241     40532.0
-weighted avg 0.9392     0.9372     0.9379     40532.0
-Confusion Matrix:
-[[32932   590   475]
- [  275  1586   233]
- [  695   278  3468]]
-================================================================================
-DETAILED RESULTS - VIHATE-T5
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/vihate-t5
-Number of Samples: 40532
-Accuracy: 0.9551
-F1 Macro: 0.8718
-F1 Weighted: 0.9535
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9660     0.9883     0.9770     33997.0
-OFFENSIVE  0.8788     0.7096     0.7852     2094.0
-HATE       0.8931     0.8165     0.8531     4441.0
-macro avg  0.9126     0.8381     0.8718     40532.0
-weighted avg 0.9535     0.9551     0.9535     40532.0
-Confusion Matrix:
-[[33599   124   274]
- [  448  1486   160]
- [  734    81  3626]]
-================================================================================
-DETAILED RESULTS - XLM-R
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/xlm-r
-Number of Samples: 40532
-Accuracy: 0.9203
-F1 Macro: 0.7625
-F1 Weighted: 0.9177
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9514     0.9733     0.9622     33997.0
-OFFENSIVE  0.6284     0.5702     0.5979     2094.0
-HATE       0.7834     0.6791     0.7275     4441.0
-macro avg  0.7877     0.7409     0.7625     40532.0
-weighted avg 0.9163     0.9203     0.9177     40532.0
-Confusion Matrix:
-[[33090   418   489]
- [  555  1194   345]
- [ 1137   288  3016]]
-================================================================================
-DETAILED RESULTS - ROBERTA-GRU
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/roberta-gru
-Number of Samples: 40532
-Accuracy: 0.9537
-F1 Macro: 0.8716
-F1 Weighted: 0.9530
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9711     0.9825     0.9768     33997.0
-OFFENSIVE  0.8136     0.7693     0.7909     2094.0
-HATE       0.8761     0.8201     0.8472     4441.0
-macro avg  0.8870     0.8573     0.8716     40532.0
-weighted avg 0.9526     0.9537     0.9530     40532.0
-Confusion Matrix:
-[[33402   237   358]
- [  326  1611   157]
- [  667   132  3642]]
-================================================================================
-DETAILED RESULTS - BILSTM
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/bilstm
-Number of Samples: 40532
-Accuracy: 0.8388
-F1 Macro: 0.3041
-F1 Weighted: 0.7652
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.8388     1.0000     0.9123     33997.0
-OFFENSIVE  0.0000     0.0000     0.0000     2094.0
-HATE       0.0000     0.0000     0.0000     4441.0
-macro avg  0.2796     0.3333     0.3041     40532.0
-weighted avg 0.7035     0.8388     0.7652     40532.0
-Confusion Matrix:
-[[33997     0     0]
- [ 2094     0     0]
- [ 4441     0     0]]
-================================================================================
-DETAILED RESULTS - TEXTCNN
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/textcnn
-Number of Samples: 40532
-Accuracy: 0.8388
-F1 Macro: 0.3041
-F1 Weighted: 0.7652
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.8388     1.0000     0.9123     33997.0
-OFFENSIVE  0.0000     0.0000     0.0000     2094.0
-HATE       0.0000     0.0000     0.0000     4441.0
-macro avg  0.2796     0.3333     0.3041     40532.0
-weighted avg 0.7035     0.8388     0.7652     40532.0
-Confusion Matrix:
-[[33997     0     0]
- [ 2094     0     0]
- [ 4441     0     0]]
-================================================================================
-DETAILED RESULTS - MBERT
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/mbert
-Number of Samples: 40532
-Accuracy: 0.9360
-F1 Macro: 0.8044
-F1 Weighted: 0.9317
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9489     0.9876     0.9679     33997.0
-OFFENSIVE  0.8645     0.5392     0.6641     2094.0
-HATE       0.8416     0.7287     0.7811     4441.0
-macro avg  0.8850     0.7518     0.8044     40532.0
-weighted avg 0.9328     0.9360     0.9317     40532.0
-Confusion Matrix:
-[[33574    93   330]
- [  686  1129   279]
- [ 1121    84  3236]]
-================================================================================
-DETAILED RESULTS - SPHOBERT
---------------------------------------------------
-Model Path: outputs/hate-speech-detection/sphobert
-Number of Samples: 40532
-Accuracy: 0.9143
-F1 Macro: 0.7378
-F1 Weighted: 0.9096
-Classification Report:
-Class      Precision  Recall     F1-Score   Support
---------------------------------------------------
-CLEAN      0.9434     0.9729     0.9579     33997.0
-OFFENSIVE  0.6821     0.4508     0.5428     2094.0
-HATE       0.7436     0.6843     0.7127     4441.0
-macro avg  0.7897     0.7027     0.7378     40532.0
-weighted avg 0.9080     0.9143     0.9096     40532.0
-Confusion Matrix:
-[[33077   253   667]
- [  769   944   381]
- [ 1215   187  3039]]
-================================================================================
-============================================================
-EVALUATION COMPLETED!
-============================================================
-Successfully evaluated: 11/11 models
-Best performing models:
-  1. vihate-t5: Accuracy=0.9551, F1=0.8718
-  2. roberta-gru: Accuracy=0.9537, F1=0.8716
-  3. phobert-v1: Accuracy=0.9421, F1=0.8308