openchat-3.5-0106_eval / llmtf_eval_k1 /evaluation_log.txt
RefalMachine's picture
Upload folder using huggingface_hub
ac85578 verified
INFO: 2024-07-12 12:01:15,409: llmtf.base.evaluator: Starting eval on ['darumeru/multiq', 'darumeru/parus', 'darumeru/rcb', 'darumeru/ruopenbookqa', 'darumeru/rutie', 'darumeru/ruworldtree', 'darumeru/rwsd', 'darumeru/use', 'russiannlp/rucola_custom']
INFO: 2024-07-12 12:01:15,410: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:01:15,410: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:01:16,412: llmtf.base.evaluator: Starting eval on ['darumeru/rummlu']
INFO: 2024-07-12 12:01:16,412: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:01:16,412: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:01:17,736: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/rummlu']
INFO: 2024-07-12 12:01:17,737: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000]
INFO: 2024-07-12 12:01:17,737: llmtf.base.hfmodel: Updated generation_config.stop_strings: []
INFO: 2024-07-12 12:01:19,755: llmtf.base.darumeru/MultiQ: Loading Dataset: 4.34s
INFO: 2024-07-12 12:01:20,121: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/enmmlu']
INFO: 2024-07-12 12:01:20,121: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000]
INFO: 2024-07-12 12:01:20,121: llmtf.base.hfmodel: Updated generation_config.stop_strings: []
INFO: 2024-07-12 12:01:21,970: llmtf.base.evaluator: Starting eval on ['daru/treewayabstractive']
INFO: 2024-07-12 12:01:21,970: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:01:21,970: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:01:23,966: llmtf.base.evaluator: Starting eval on ['daru/treewayextractive']
INFO: 2024-07-12 12:01:23,969: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000]
INFO: 2024-07-12 12:01:23,970: llmtf.base.hfmodel: Updated generation_config.stop_strings: []
INFO: 2024-07-12 12:01:25,583: llmtf.base.evaluator: Starting eval on ['darumeru/cp_sent_ru', 'darumeru/cp_sent_en', 'darumeru/cp_para_ru', 'darumeru/cp_para_en']
INFO: 2024-07-12 12:01:25,589: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:01:25,589: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:01:27,070: llmtf.base.daru/treewayabstractive: Loading Dataset: 5.10s
INFO: 2024-07-12 12:01:28,289: llmtf.base.darumeru/cp_sent_ru: Loading Dataset: 2.70s
INFO: 2024-07-12 12:01:28,621: llmtf.base.darumeru/ruMMLU: Loading Dataset: 12.21s
INFO: 2024-07-12 12:01:31,710: llmtf.base.daru/treewayextractive: Loading Dataset: 7.74s
INFO: 2024-07-12 12:03:37,385: llmtf.base.nlpcoreteam/ruMMLU: Loading Dataset: 139.65s
INFO: 2024-07-12 12:03:39,839: llmtf.base.nlpcoreteam/enMMLU: Loading Dataset: 139.72s
INFO: 2024-07-12 12:07:14,847: llmtf.base.darumeru/MultiQ: Processing Dataset: 355.09s
INFO: 2024-07-12 12:07:14,849: llmtf.base.darumeru/MultiQ: Results for darumeru/MultiQ:
INFO: 2024-07-12 12:07:14,853: llmtf.base.darumeru/MultiQ: {'f1': 0.5552794495909491, 'em': 0.4751434034416826}
INFO: 2024-07-12 12:07:14,860: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:07:14,860: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:07:17,397: llmtf.base.darumeru/PARus: Loading Dataset: 2.54s
INFO: 2024-07-12 12:07:26,411: llmtf.base.darumeru/PARus: Processing Dataset: 9.01s
INFO: 2024-07-12 12:07:26,412: llmtf.base.darumeru/PARus: Results for darumeru/PARus:
INFO: 2024-07-12 12:07:26,436: llmtf.base.darumeru/PARus: {'acc': 0.84}
INFO: 2024-07-12 12:07:26,437: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:07:26,437: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:07:28,524: llmtf.base.darumeru/RCB: Loading Dataset: 2.09s
INFO: 2024-07-12 12:07:40,423: llmtf.base.darumeru/RCB: Processing Dataset: 11.88s
INFO: 2024-07-12 12:07:40,425: llmtf.base.darumeru/RCB: Results for darumeru/RCB:
INFO: 2024-07-12 12:07:40,431: llmtf.base.darumeru/RCB: {'acc': 0.5181818181818182, 'f1_macro': 0.4444347650097234}
INFO: 2024-07-12 12:07:40,432: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:07:40,432: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:07:44,279: llmtf.base.darumeru/ruOpenBookQA: Loading Dataset: 3.85s
INFO: 2024-07-12 12:09:26,905: llmtf.base.darumeru/ruOpenBookQA: Processing Dataset: 102.62s
INFO: 2024-07-12 12:09:26,914: llmtf.base.darumeru/ruOpenBookQA: Results for darumeru/ruOpenBookQA:
INFO: 2024-07-12 12:09:26,952: llmtf.base.darumeru/ruOpenBookQA: {'acc': 0.7422680412371134, 'f1_macro': 0.742617154065763}
INFO: 2024-07-12 12:09:26,967: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:09:26,967: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:09:31,320: llmtf.base.darumeru/ruTiE: Loading Dataset: 4.35s
INFO: 2024-07-12 12:11:39,095: llmtf.base.darumeru/ruMMLU: Processing Dataset: 610.47s
INFO: 2024-07-12 12:11:39,097: llmtf.base.darumeru/ruMMLU: Results for darumeru/ruMMLU:
INFO: 2024-07-12 12:11:39,121: llmtf.base.darumeru/ruMMLU: {'acc': 0.4805946323456051}
INFO: 2024-07-12 12:11:39,172: llmtf.base.evaluator: Ended eval
INFO: 2024-07-12 12:11:39,182: llmtf.base.evaluator:
mean darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/ruMMLU darumeru/ruOpenBookQA
0.612 0.515 0.840 0.481 0.481 0.742
INFO: 2024-07-12 12:12:36,742: llmtf.base.darumeru/cp_sent_ru: Processing Dataset: 668.45s
INFO: 2024-07-12 12:12:36,745: llmtf.base.darumeru/cp_sent_ru: Results for darumeru/cp_sent_ru:
INFO: 2024-07-12 12:12:36,749: llmtf.base.darumeru/cp_sent_ru: {'symbol_per_token': 2.368351864419177, 'len': 0.9975712237925127, 'lcs': 0.9673485397373391}
INFO: 2024-07-12 12:12:36,750: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:12:36,751: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:12:38,981: llmtf.base.darumeru/cp_sent_en: Loading Dataset: 2.23s
INFO: 2024-07-12 12:13:54,250: llmtf.base.darumeru/ruTiE: Processing Dataset: 262.93s
INFO: 2024-07-12 12:13:54,253: llmtf.base.darumeru/ruTiE: Results for darumeru/ruTiE:
INFO: 2024-07-12 12:13:54,282: llmtf.base.darumeru/ruTiE: {'acc': 0.5395348837209303}
INFO: 2024-07-12 12:13:54,286: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:13:54,286: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:13:56,875: llmtf.base.darumeru/ruWorldTree: Loading Dataset: 2.59s
INFO: 2024-07-12 12:14:01,832: llmtf.base.darumeru/ruWorldTree: Processing Dataset: 4.96s
INFO: 2024-07-12 12:14:01,833: llmtf.base.darumeru/ruWorldTree: Results for darumeru/ruWorldTree:
INFO: 2024-07-12 12:14:01,838: llmtf.base.darumeru/ruWorldTree: {'acc': 0.8666666666666667, 'f1_macro': 0.8655425965568433}
INFO: 2024-07-12 12:14:01,839: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:14:01,839: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:14:04,137: llmtf.base.darumeru/RWSD: Loading Dataset: 2.30s
INFO: 2024-07-12 12:14:14,784: llmtf.base.darumeru/RWSD: Processing Dataset: 10.65s
INFO: 2024-07-12 12:14:14,786: llmtf.base.darumeru/RWSD: Results for darumeru/RWSD:
INFO: 2024-07-12 12:14:14,790: llmtf.base.darumeru/RWSD: {'acc': 0.5833333333333334}
INFO: 2024-07-12 12:14:14,791: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:14:14,791: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:14:17,901: llmtf.base.darumeru/USE: Loading Dataset: 3.11s
INFO: 2024-07-12 12:15:22,773: llmtf.base.nlpcoreteam/enMMLU: Processing Dataset: 702.93s
INFO: 2024-07-12 12:15:22,777: llmtf.base.nlpcoreteam/enMMLU: Results for nlpcoreteam/enMMLU:
INFO: 2024-07-12 12:15:22,782: llmtf.base.daru/treewayextractive: Processing Dataset: 831.07s
INFO: 2024-07-12 12:15:22,787: llmtf.base.daru/treewayextractive: Results for daru/treewayextractive:
INFO: 2024-07-12 12:15:22,819: llmtf.base.nlpcoreteam/enMMLU: metric
subject
abstract_algebra 0.310000
anatomy 0.674074
astronomy 0.651316
business_ethics 0.630000
clinical_knowledge 0.716981
college_biology 0.701389
college_chemistry 0.460000
college_computer_science 0.540000
college_mathematics 0.340000
college_medicine 0.664740
college_physics 0.372549
computer_security 0.760000
conceptual_physics 0.544681
econometrics 0.482456
electrical_engineering 0.558621
elementary_mathematics 0.412698
formal_logic 0.500000
global_facts 0.370000
high_school_biology 0.767742
high_school_chemistry 0.438424
high_school_computer_science 0.690000
high_school_european_history 0.793939
high_school_geography 0.782828
high_school_government_and_politics 0.896373
high_school_macroeconomics 0.623077
high_school_mathematics 0.344444
high_school_microeconomics 0.672269
high_school_physics 0.370861
high_school_psychology 0.822018
high_school_statistics 0.509259
high_school_us_history 0.828431
high_school_world_history 0.801688
human_aging 0.699552
human_sexuality 0.732824
international_law 0.809917
jurisprudence 0.750000
logical_fallacies 0.748466
machine_learning 0.500000
management 0.796117
marketing 0.871795
medical_genetics 0.730000
miscellaneous 0.827586
moral_disputes 0.728324
moral_scenarios 0.174302
nutrition 0.738562
philosophy 0.707395
prehistory 0.743827
professional_accounting 0.489362
professional_law 0.478488
professional_medicine 0.661765
professional_psychology 0.642157
public_relations 0.645455
security_studies 0.751020
sociology 0.845771
us_foreign_policy 0.890000
virology 0.500000
world_religions 0.824561
INFO: 2024-07-12 12:15:22,827: llmtf.base.nlpcoreteam/enMMLU: metric
subject
STEM 0.515110
humanities 0.683795
other (business, health, misc.) 0.669324
social sciences 0.732187
INFO: 2024-07-12 12:15:22,859: llmtf.base.nlpcoreteam/enMMLU: {'acc': 0.6501041813275048}
INFO: 2024-07-12 12:15:22,901: llmtf.base.evaluator: Ended eval
INFO: 2024-07-12 12:15:22,909: llmtf.base.evaluator:
mean darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU
0.670 0.515 0.840 0.481 0.583 0.998 0.481 0.742 0.540 0.866 0.650
INFO: 2024-07-12 12:15:23,042: llmtf.base.daru/treewayextractive: {'r-prec': 0.4038567821067821}
INFO: 2024-07-12 12:15:23,640: llmtf.base.evaluator: Ended eval
INFO: 2024-07-12 12:15:23,779: llmtf.base.evaluator:
mean daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU
0.645 0.404 0.515 0.840 0.481 0.583 0.998 0.481 0.742 0.540 0.866 0.650
INFO: 2024-07-12 12:18:08,369: llmtf.base.nlpcoreteam/ruMMLU: Processing Dataset: 870.98s
INFO: 2024-07-12 12:18:08,388: llmtf.base.nlpcoreteam/ruMMLU: Results for nlpcoreteam/ruMMLU:
INFO: 2024-07-12 12:18:08,427: llmtf.base.nlpcoreteam/ruMMLU: metric
subject
abstract_algebra 0.340000
anatomy 0.385185
astronomy 0.565789
business_ethics 0.540000
clinical_knowledge 0.539623
college_biology 0.472222
college_chemistry 0.400000
college_computer_science 0.460000
college_mathematics 0.400000
college_medicine 0.537572
college_physics 0.323529
computer_security 0.610000
conceptual_physics 0.493617
econometrics 0.421053
electrical_engineering 0.503448
elementary_mathematics 0.365079
formal_logic 0.365079
global_facts 0.330000
high_school_biology 0.612903
high_school_chemistry 0.384236
high_school_computer_science 0.580000
high_school_european_history 0.696970
high_school_geography 0.661616
high_school_government_and_politics 0.611399
high_school_macroeconomics 0.464103
high_school_mathematics 0.340741
high_school_microeconomics 0.504202
high_school_physics 0.357616
high_school_psychology 0.605505
high_school_statistics 0.430556
high_school_us_history 0.725490
high_school_world_history 0.704641
human_aging 0.493274
human_sexuality 0.572519
international_law 0.685950
jurisprudence 0.564815
logical_fallacies 0.472393
machine_learning 0.410714
management 0.631068
marketing 0.730769
medical_genetics 0.540000
miscellaneous 0.615581
moral_disputes 0.575145
moral_scenarios 0.158659
nutrition 0.568627
philosophy 0.530547
prehistory 0.537037
professional_accounting 0.354610
professional_law 0.364407
professional_medicine 0.419118
professional_psychology 0.455882
public_relations 0.509091
security_studies 0.644898
sociology 0.701493
us_foreign_policy 0.700000
virology 0.457831
world_religions 0.736842
INFO: 2024-07-12 12:18:08,434: llmtf.base.nlpcoreteam/ruMMLU: metric
subject
STEM 0.447247
humanities 0.547537
other (business, health, misc.) 0.510233
social sciences 0.570980
INFO: 2024-07-12 12:18:08,457: llmtf.base.nlpcoreteam/ruMMLU: {'acc': 0.5189991335159189}
INFO: 2024-07-12 12:18:08,503: llmtf.base.evaluator: Ended eval
INFO: 2024-07-12 12:18:08,517: llmtf.base.evaluator:
mean daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU nlpcoreteam/ruMMLU
0.635 0.404 0.515 0.840 0.481 0.583 0.998 0.481 0.742 0.540 0.866 0.650 0.519
INFO: 2024-07-12 12:20:19,190: llmtf.base.darumeru/USE: Processing Dataset: 361.28s
INFO: 2024-07-12 12:20:19,194: llmtf.base.darumeru/USE: Results for darumeru/USE:
INFO: 2024-07-12 12:20:19,199: llmtf.base.darumeru/USE: {'grade_norm': 0.08921568627450979}
INFO: 2024-07-12 12:20:19,203: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000]
INFO: 2024-07-12 12:20:19,203: llmtf.base.hfmodel: Updated generation_config.stop_strings: []
INFO: 2024-07-12 12:20:24,193: llmtf.base.russiannlp/rucola_custom: Loading Dataset: 4.99s
INFO: 2024-07-12 12:21:26,040: llmtf.base.darumeru/cp_sent_en: Processing Dataset: 527.06s
INFO: 2024-07-12 12:21:26,042: llmtf.base.darumeru/cp_sent_en: Results for darumeru/cp_sent_en:
INFO: 2024-07-12 12:21:26,048: llmtf.base.darumeru/cp_sent_en: {'symbol_per_token': 3.9001608324310224, 'len': 0.9990183431863008, 'lcs': 0.9938456701457182}
INFO: 2024-07-12 12:21:26,050: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:21:26,050: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:21:28,516: llmtf.base.darumeru/cp_para_ru: Loading Dataset: 2.47s
INFO: 2024-07-12 12:22:31,014: llmtf.base.russiannlp/rucola_custom: Processing Dataset: 126.82s
INFO: 2024-07-12 12:22:31,016: llmtf.base.russiannlp/rucola_custom: Results for russiannlp/rucola_custom:
INFO: 2024-07-12 12:22:31,059: llmtf.base.russiannlp/rucola_custom: {'acc': 0.7312522425547183, 'mcc': 0.32496683222587364}
INFO: 2024-07-12 12:22:31,065: llmtf.base.evaluator: Ended eval
INFO: 2024-07-12 12:22:31,076: llmtf.base.evaluator:
mean daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/USE darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU nlpcoreteam/ruMMLU russiannlp/rucola_custom
0.616 0.404 0.515 0.840 0.481 0.583 0.089 0.999 0.998 0.481 0.742 0.540 0.866 0.650 0.519 0.528
INFO: 2024-07-12 12:36:23,699: llmtf.base.darumeru/cp_para_ru: Processing Dataset: 895.18s
INFO: 2024-07-12 12:36:23,704: llmtf.base.darumeru/cp_para_ru: Results for darumeru/cp_para_ru:
INFO: 2024-07-12 12:36:23,738: llmtf.base.darumeru/cp_para_ru: {'symbol_per_token': 2.4722468821961323, 'len': 0.996050202820598, 'lcs': 0.900415560077835}
INFO: 2024-07-12 12:36:23,746: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [32000, 13]
INFO: 2024-07-12 12:36:23,746: llmtf.base.hfmodel: Updated generation_config.stop_strings: ['\n', '\n\n']
INFO: 2024-07-12 12:36:25,765: llmtf.base.darumeru/cp_para_en: Loading Dataset: 2.02s
INFO: 2024-07-12 12:41:45,524: llmtf.base.daru/treewayabstractive: Processing Dataset: 2418.44s
INFO: 2024-07-12 12:41:45,528: llmtf.base.daru/treewayabstractive: Results for daru/treewayabstractive:
INFO: 2024-07-12 12:41:45,533: llmtf.base.daru/treewayabstractive: {'rouge1': 0.3563853188681492, 'rouge2': 0.12951199754927947}
INFO: 2024-07-12 12:41:45,536: llmtf.base.evaluator: Ended eval
INFO: 2024-07-12 12:41:45,564: llmtf.base.evaluator:
mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/USE darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU nlpcoreteam/ruMMLU russiannlp/rucola_custom
0.611 0.243 0.404 0.515 0.840 0.481 0.583 0.089 0.900 0.999 0.998 0.481 0.742 0.540 0.866 0.650 0.519 0.528
INFO: 2024-07-12 12:47:44,505: llmtf.base.darumeru/cp_para_en: Processing Dataset: 678.73s
INFO: 2024-07-12 12:47:44,507: llmtf.base.darumeru/cp_para_en: Results for darumeru/cp_para_en:
INFO: 2024-07-12 12:47:44,528: llmtf.base.darumeru/cp_para_en: {'symbol_per_token': 3.961010453365225, 'len': 0.9994091346932804, 'lcs': 0.9754829484099882}
INFO: 2024-07-12 12:47:44,528: llmtf.base.evaluator: Ended eval
INFO: 2024-07-12 12:47:44,542: llmtf.base.evaluator:
mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/USE darumeru/cp_para_en darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU nlpcoreteam/ruMMLU russiannlp/rucola_custom
0.631 0.243 0.404 0.515 0.840 0.481 0.583 0.089 0.975 0.900 0.999 0.998 0.481 0.742 0.540 0.866 0.650 0.519 0.528