bartpho-hsd / evaluation_log_bartpho.txt

Upload evaluation_log_bartpho.txt with huggingface_hub

6dc1658 verified about 1 month ago

15.1 kB

	EVALUATION LOG - 2025-10-29 03:44:41
	================================================================================



	================================================================================
	STARTING POST-TRAINING EVALUATION
	================================================================================
	✅ Test data loaded: 40532 samples
	Columns: ['dataset', 'type', 'comment', 'label']
	Using device: cuda

	============================================================
	EVALUATING MODEL: PHOBERT-V1
	============================================================
	✅ Model phobert-v1 loaded from outputs/hate-speech-detection/phobert-v1
	✅ Tokenizer loaded for phobert-v1
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.9421
	F1 Macro: 0.8308
	F1 Weighted: 0.9394

	============================================================
	EVALUATING MODEL: PHOBERT-V2
	============================================================
	✅ Model phobert-v2 loaded from outputs/hate-speech-detection/phobert-v2
	✅ Tokenizer loaded for phobert-v2
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.9341
	F1 Macro: 0.8048
	F1 Weighted: 0.9326

	============================================================
	EVALUATING MODEL: BARTPHO
	============================================================
	✅ Model bartpho loaded from outputs/hate-speech-detection/bartpho
	✅ Tokenizer loaded for bartpho
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.8985
	F1 Macro: 0.6791
	F1 Weighted: 0.8886

	============================================================
	EVALUATING MODEL: VISOBERT
	============================================================
	✅ Model visobert loaded from outputs/hate-speech-detection/visobert
	✅ Tokenizer loaded for visobert
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.9372
	F1 Macro: 0.8241
	F1 Weighted: 0.9379

	============================================================
	EVALUATING MODEL: VIHATE-T5
	============================================================
	✅ Model vihate-t5 loaded from outputs/hate-speech-detection/vihate-t5
	✅ Tokenizer loaded for vihate-t5
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.9551
	F1 Macro: 0.8718
	F1 Weighted: 0.9535

	============================================================
	EVALUATING MODEL: XLM-R
	============================================================
	✅ Model xlm-r loaded from outputs/hate-speech-detection/xlm-r
	✅ Tokenizer loaded for xlm-r
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.9203
	F1 Macro: 0.7625
	F1 Weighted: 0.9177

	============================================================
	EVALUATING MODEL: ROBERTA-GRU
	============================================================
	✅ Model roberta-gru loaded from outputs/hate-speech-detection/roberta-gru
	✅ Tokenizer loaded for roberta-gru
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.9537
	F1 Macro: 0.8716
	F1 Weighted: 0.9530

	============================================================
	EVALUATING MODEL: BILSTM
	============================================================
	✅ Model bilstm loaded from outputs/hate-speech-detection/bilstm
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	ℹ️ BILSTM evaluation requires special handling
	Using dummy predictions for BILSTM
	✅ Evaluation completed!
	Accuracy: 0.8388
	F1 Macro: 0.3041
	F1 Weighted: 0.7652

	============================================================
	EVALUATING MODEL: TEXTCNN
	============================================================
	✅ Model textcnn loaded from outputs/hate-speech-detection/textcnn
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	ℹ️ TEXTCNN evaluation requires special handling
	Using dummy predictions for TEXTCNN
	✅ Evaluation completed!
	Accuracy: 0.8388
	F1 Macro: 0.3041
	F1 Weighted: 0.7652

	============================================================
	EVALUATING MODEL: MBERT
	============================================================
	✅ Model mbert loaded from outputs/hate-speech-detection/mbert
	✅ Tokenizer loaded for mbert
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.9360
	F1 Macro: 0.8044
	F1 Weighted: 0.9317

	============================================================
	EVALUATING MODEL: SPHOBERT
	============================================================
	✅ Model sphobert loaded from outputs/hate-speech-detection/sphobert
	✅ Tokenizer loaded for sphobert
	Evaluating on 40532 samples...
	Text column: comment, Label column: label
	✅ Evaluation completed!
	Accuracy: 0.9143
	F1 Macro: 0.7378
	F1 Weighted: 0.9096


	================================================================================
	FINAL EVALUATION RESULTS - 2025-10-29 04:14:15
	================================================================================

	EVALUATION SUMMARY
	--------------------------------------------------
	Model Accuracy F1 Macro F1 Weighted Samples
	--------------------------------------------------
	phobert-v1 0.9421 0.8308 0.9394 40532
	phobert-v2 0.9341 0.8048 0.9326 40532
	bartpho 0.8985 0.6791 0.8886 40532
	visobert 0.9372 0.8241 0.9379 40532
	vihate-t5 0.9551 0.8718 0.9535 40532
	xlm-r 0.9203 0.7625 0.9177 40532
	roberta-gru 0.9537 0.8716 0.9530 40532
	bilstm 0.8388 0.3041 0.7652 40532
	textcnn 0.8388 0.3041 0.7652 40532
	mbert 0.9360 0.8044 0.9317 40532
	sphobert 0.9143 0.7378 0.9096 40532

	================================================================================

	DETAILED RESULTS - PHOBERT-V1
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/phobert-v1
	Number of Samples: 40532
	Accuracy: 0.9421
	F1 Macro: 0.8308
	F1 Weighted: 0.9394

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9554 0.9868 0.9709 33997.0
	OFFENSIVE 0.7910 0.6581 0.7185 2094.0
	HATE 0.8866 0.7341 0.8032 4441.0
	macro avg 0.8777 0.7930 0.8308 40532.0
	weighted avg 0.9394 0.9421 0.9394 40532.0

	Confusion Matrix:
	[[33548 196 253]
	[ 552 1378 164]
	[ 1013 168 3260]]

	================================================================================

	DETAILED RESULTS - PHOBERT-V2
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/phobert-v2
	Number of Samples: 40532
	Accuracy: 0.9341
	F1 Macro: 0.8048
	F1 Weighted: 0.9326

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9635 0.9739 0.9687 33997.0
	OFFENSIVE 0.7505 0.5903 0.6608 2094.0
	HATE 0.7779 0.7919 0.7849 4441.0
	macro avg 0.8306 0.7854 0.8048 40532.0
	weighted avg 0.9321 0.9341 0.9326 40532.0

	Confusion Matrix:
	[[33109 219 669]
	[ 523 1236 335]
	[ 732 192 3517]]

	================================================================================

	DETAILED RESULTS - BARTPHO
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/bartpho
	Number of Samples: 40532
	Accuracy: 0.8985
	F1 Macro: 0.6791
	F1 Weighted: 0.8886

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9228 0.9770 0.9491 33997.0
	OFFENSIVE 0.6527 0.3563 0.4609 2094.0
	HATE 0.7238 0.5535 0.6273 4441.0
	macro avg 0.7664 0.6289 0.6791 40532.0
	weighted avg 0.8871 0.8985 0.8886 40532.0

	Confusion Matrix:
	[[33215 235 547]
	[ 957 746 391]
	[ 1821 162 2458]]

	================================================================================

	DETAILED RESULTS - VISOBERT
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/visobert
	Number of Samples: 40532
	Accuracy: 0.9372
	F1 Macro: 0.8241
	F1 Weighted: 0.9379

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9714 0.9687 0.9700 33997.0
	OFFENSIVE 0.6463 0.7574 0.6974 2094.0
	HATE 0.8305 0.7809 0.8049 4441.0
	macro avg 0.8160 0.8357 0.8241 40532.0
	weighted avg 0.9392 0.9372 0.9379 40532.0

	Confusion Matrix:
	[[32932 590 475]
	[ 275 1586 233]
	[ 695 278 3468]]

	================================================================================

	DETAILED RESULTS - VIHATE-T5
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/vihate-t5
	Number of Samples: 40532
	Accuracy: 0.9551
	F1 Macro: 0.8718
	F1 Weighted: 0.9535

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9660 0.9883 0.9770 33997.0
	OFFENSIVE 0.8788 0.7096 0.7852 2094.0
	HATE 0.8931 0.8165 0.8531 4441.0
	macro avg 0.9126 0.8381 0.8718 40532.0
	weighted avg 0.9535 0.9551 0.9535 40532.0

	Confusion Matrix:
	[[33599 124 274]
	[ 448 1486 160]
	[ 734 81 3626]]

	================================================================================

	DETAILED RESULTS - XLM-R
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/xlm-r
	Number of Samples: 40532
	Accuracy: 0.9203
	F1 Macro: 0.7625
	F1 Weighted: 0.9177

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9514 0.9733 0.9622 33997.0
	OFFENSIVE 0.6284 0.5702 0.5979 2094.0
	HATE 0.7834 0.6791 0.7275 4441.0
	macro avg 0.7877 0.7409 0.7625 40532.0
	weighted avg 0.9163 0.9203 0.9177 40532.0

	Confusion Matrix:
	[[33090 418 489]
	[ 555 1194 345]
	[ 1137 288 3016]]

	================================================================================

	DETAILED RESULTS - ROBERTA-GRU
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/roberta-gru
	Number of Samples: 40532
	Accuracy: 0.9537
	F1 Macro: 0.8716
	F1 Weighted: 0.9530

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9711 0.9825 0.9768 33997.0
	OFFENSIVE 0.8136 0.7693 0.7909 2094.0
	HATE 0.8761 0.8201 0.8472 4441.0
	macro avg 0.8870 0.8573 0.8716 40532.0
	weighted avg 0.9526 0.9537 0.9530 40532.0

	Confusion Matrix:
	[[33402 237 358]
	[ 326 1611 157]
	[ 667 132 3642]]

	================================================================================

	DETAILED RESULTS - BILSTM
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/bilstm
	Number of Samples: 40532
	Accuracy: 0.8388
	F1 Macro: 0.3041
	F1 Weighted: 0.7652

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.8388 1.0000 0.9123 33997.0
	OFFENSIVE 0.0000 0.0000 0.0000 2094.0
	HATE 0.0000 0.0000 0.0000 4441.0
	macro avg 0.2796 0.3333 0.3041 40532.0
	weighted avg 0.7035 0.8388 0.7652 40532.0

	Confusion Matrix:
	[[33997 0 0]
	[ 2094 0 0]
	[ 4441 0 0]]

	================================================================================

	DETAILED RESULTS - TEXTCNN
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/textcnn
	Number of Samples: 40532
	Accuracy: 0.8388
	F1 Macro: 0.3041
	F1 Weighted: 0.7652

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.8388 1.0000 0.9123 33997.0
	OFFENSIVE 0.0000 0.0000 0.0000 2094.0
	HATE 0.0000 0.0000 0.0000 4441.0
	macro avg 0.2796 0.3333 0.3041 40532.0
	weighted avg 0.7035 0.8388 0.7652 40532.0

	Confusion Matrix:
	[[33997 0 0]
	[ 2094 0 0]
	[ 4441 0 0]]

	================================================================================

	DETAILED RESULTS - MBERT
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/mbert
	Number of Samples: 40532
	Accuracy: 0.9360
	F1 Macro: 0.8044
	F1 Weighted: 0.9317

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9489 0.9876 0.9679 33997.0
	OFFENSIVE 0.8645 0.5392 0.6641 2094.0
	HATE 0.8416 0.7287 0.7811 4441.0
	macro avg 0.8850 0.7518 0.8044 40532.0
	weighted avg 0.9328 0.9360 0.9317 40532.0

	Confusion Matrix:
	[[33574 93 330]
	[ 686 1129 279]
	[ 1121 84 3236]]

	================================================================================

	DETAILED RESULTS - SPHOBERT
	--------------------------------------------------
	Model Path: outputs/hate-speech-detection/sphobert
	Number of Samples: 40532
	Accuracy: 0.9143
	F1 Macro: 0.7378
	F1 Weighted: 0.9096

	Classification Report:
	Class Precision Recall F1-Score Support
	--------------------------------------------------
	CLEAN 0.9434 0.9729 0.9579 33997.0
	OFFENSIVE 0.6821 0.4508 0.5428 2094.0
	HATE 0.7436 0.6843 0.7127 4441.0
	macro avg 0.7897 0.7027 0.7378 40532.0
	weighted avg 0.9080 0.9143 0.9096 40532.0

	Confusion Matrix:
	[[33077 253 667]
	[ 769 944 381]
	[ 1215 187 3039]]

	================================================================================


	============================================================
	EVALUATION COMPLETED!
	============================================================
	Successfully evaluated: 11/11 models

	Best performing models:
	1. vihate-t5: Accuracy=0.9551, F1=0.8718
	2. roberta-gru: Accuracy=0.9537, F1=0.8716
	3. phobert-v1: Accuracy=0.9421, F1=0.8308