INFO: 2024-07-13 14:29:23,827: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/rummlu'] INFO: 2024-07-13 14:29:23,828: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:29:23,828: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:29:23,892: llmtf.base.evaluator: Starting eval on ['darumeru/rummlu'] INFO: 2024-07-13 14:29:23,896: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:29:23,896: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:29:24,151: llmtf.base.evaluator: Starting eval on ['darumeru/multiq', 'darumeru/parus', 'darumeru/rcb', 'darumeru/ruopenbookqa', 'darumeru/rutie', 'darumeru/ruworldtree', 'darumeru/rwsd', 'darumeru/use', 'russiannlp/rucola_custom'] INFO: 2024-07-13 14:29:24,154: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:29:24,154: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:29:24,345: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/enmmlu'] INFO: 2024-07-13 14:29:24,346: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:29:24,346: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:29:25,729: llmtf.base.evaluator: Starting eval on ['daru/treewayabstractive'] INFO: 2024-07-13 14:29:25,731: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:29:25,731: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:29:27,678: llmtf.base.evaluator: Starting eval on ['daru/treewayextractive'] INFO: 2024-07-13 14:29:27,678: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:29:27,678: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:29:29,484: llmtf.base.evaluator: Starting eval on ['darumeru/cp_sent_ru', 'darumeru/cp_sent_en', 'darumeru/cp_para_ru', 'darumeru/cp_para_en'] INFO: 2024-07-13 14:29:29,484: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:29:29,484: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:29:33,887: llmtf.base.darumeru/cp_sent_ru: Loading Dataset: 4.40s INFO: 2024-07-13 14:29:39,828: llmtf.base.daru/treewayextractive: Loading Dataset: 12.15s INFO: 2024-07-13 14:29:42,885: llmtf.base.daru/treewayabstractive: Loading Dataset: 17.15s INFO: 2024-07-13 14:29:45,765: llmtf.base.darumeru/MultiQ: Loading Dataset: 21.61s INFO: 2024-07-13 14:30:53,478: llmtf.base.darumeru/ruMMLU: Loading Dataset: 89.58s INFO: 2024-07-13 14:32:57,360: llmtf.base.nlpcoreteam/enMMLU: Loading Dataset: 213.01s INFO: 2024-07-13 14:33:24,939: llmtf.base.darumeru/cp_sent_ru: Processing Dataset: 231.05s INFO: 2024-07-13 14:33:24,943: llmtf.base.darumeru/cp_sent_ru: Results for darumeru/cp_sent_ru: INFO: 2024-07-13 14:33:24,962: llmtf.base.darumeru/cp_sent_ru: {'symbol_per_token': 2.8278810271761903, 'len': 0.9977030047832767, 'lcs': 0.9847970468194288} INFO: 2024-07-13 14:33:24,975: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:33:24,975: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:33:28,742: llmtf.base.darumeru/cp_sent_en: Loading Dataset: 3.77s INFO: 2024-07-13 14:33:45,284: llmtf.base.nlpcoreteam/ruMMLU: Loading Dataset: 261.46s INFO: 2024-07-13 14:36:13,193: llmtf.base.darumeru/cp_sent_en: Processing Dataset: 164.45s INFO: 2024-07-13 14:36:13,226: llmtf.base.darumeru/cp_sent_en: Results for darumeru/cp_sent_en: INFO: 2024-07-13 14:36:13,244: llmtf.base.darumeru/cp_sent_en: {'symbol_per_token': 4.424509793356442, 'len': 0.9995781033988959, 'lcs': 0.994055994028679} INFO: 2024-07-13 14:36:13,246: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:36:13,246: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:36:17,469: llmtf.base.darumeru/cp_para_ru: Loading Dataset: 4.22s INFO: 2024-07-13 14:36:19,338: llmtf.base.daru/treewayextractive: Processing Dataset: 399.51s INFO: 2024-07-13 14:36:19,340: llmtf.base.daru/treewayextractive: Results for daru/treewayextractive: INFO: 2024-07-13 14:36:19,799: llmtf.base.daru/treewayextractive: {'r-prec': 0.39738621933621937} INFO: 2024-07-13 14:36:19,844: llmtf.base.evaluator: Ended eval INFO: 2024-07-13 14:36:19,850: llmtf.base.evaluator: mean daru/treewayextractive darumeru/cp_sent_en darumeru/cp_sent_ru 0.798 0.397 1.000 0.998 INFO: 2024-07-13 14:36:56,298: llmtf.base.darumeru/MultiQ: Processing Dataset: 430.53s INFO: 2024-07-13 14:36:56,300: llmtf.base.darumeru/MultiQ: Results for darumeru/MultiQ: INFO: 2024-07-13 14:36:56,305: llmtf.base.darumeru/MultiQ: {'f1': 0.48425376524800046, 'em': 0.3795411089866157} INFO: 2024-07-13 14:36:56,316: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:36:56,317: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:37:00,009: llmtf.base.darumeru/PARus: Loading Dataset: 3.69s INFO: 2024-07-13 14:37:13,006: llmtf.base.darumeru/PARus: Processing Dataset: 13.00s INFO: 2024-07-13 14:37:13,009: llmtf.base.darumeru/PARus: Results for darumeru/PARus: INFO: 2024-07-13 14:37:13,021: llmtf.base.darumeru/PARus: {'acc': 0.85} INFO: 2024-07-13 14:37:13,023: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:37:13,023: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:37:16,908: llmtf.base.darumeru/RCB: Loading Dataset: 3.88s INFO: 2024-07-13 14:37:39,047: llmtf.base.darumeru/RCB: Processing Dataset: 22.12s INFO: 2024-07-13 14:37:39,050: llmtf.base.darumeru/RCB: Results for darumeru/RCB: INFO: 2024-07-13 14:37:39,056: llmtf.base.darumeru/RCB: {'acc': 0.5272727272727272, 'f1_macro': 0.43555405633327715} INFO: 2024-07-13 14:37:39,058: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:37:39,059: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:37:53,697: llmtf.base.darumeru/ruOpenBookQA: Loading Dataset: 14.64s INFO: 2024-07-13 14:40:08,010: llmtf.base.darumeru/ruOpenBookQA: Processing Dataset: 134.31s INFO: 2024-07-13 14:40:08,013: llmtf.base.darumeru/ruOpenBookQA: Results for darumeru/ruOpenBookQA: INFO: 2024-07-13 14:40:08,027: llmtf.base.darumeru/ruOpenBookQA: {'acc': 0.7680412371134021, 'f1_macro': 0.7680185950653384} INFO: 2024-07-13 14:40:08,043: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:40:08,043: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:40:15,245: llmtf.base.darumeru/ruTiE: Loading Dataset: 7.20s INFO: 2024-07-13 14:41:10,015: llmtf.base.daru/treewayabstractive: Processing Dataset: 687.13s INFO: 2024-07-13 14:41:10,017: llmtf.base.daru/treewayabstractive: Results for daru/treewayabstractive: INFO: 2024-07-13 14:41:10,037: llmtf.base.daru/treewayabstractive: {'rouge1': 0.360975899636531, 'rouge2': 0.1330737491255763} INFO: 2024-07-13 14:41:10,042: llmtf.base.evaluator: Ended eval INFO: 2024-07-13 14:41:10,069: llmtf.base.evaluator: mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruOpenBookQA 0.647 0.247 0.397 0.432 0.850 0.481 1.000 0.998 0.768 INFO: 2024-07-13 14:41:58,403: llmtf.base.darumeru/cp_para_ru: Processing Dataset: 340.93s INFO: 2024-07-13 14:41:58,453: llmtf.base.darumeru/cp_para_ru: Results for darumeru/cp_para_ru: INFO: 2024-07-13 14:41:58,457: llmtf.base.darumeru/cp_para_ru: {'symbol_per_token': 2.9697516062295746, 'len': 0.9984044778480231, 'lcs': 0.9773285044731846} INFO: 2024-07-13 14:41:58,459: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:41:58,459: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:42:02,784: llmtf.base.darumeru/cp_para_en: Loading Dataset: 4.32s INFO: 2024-07-13 14:44:45,025: llmtf.base.darumeru/ruTiE: Processing Dataset: 269.78s INFO: 2024-07-13 14:44:45,027: llmtf.base.darumeru/ruTiE: Results for darumeru/ruTiE: INFO: 2024-07-13 14:44:45,073: llmtf.base.darumeru/ruTiE: {'acc': 0.3511627906976744} INFO: 2024-07-13 14:44:45,076: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:44:45,077: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:44:47,875: llmtf.base.darumeru/ruWorldTree: Loading Dataset: 2.80s INFO: 2024-07-13 14:44:55,693: llmtf.base.darumeru/ruWorldTree: Processing Dataset: 7.80s INFO: 2024-07-13 14:44:55,695: llmtf.base.darumeru/ruWorldTree: Results for darumeru/ruWorldTree: INFO: 2024-07-13 14:44:55,700: llmtf.base.darumeru/ruWorldTree: {'acc': 0.8761904761904762, 'f1_macro': 0.8733631471423589} INFO: 2024-07-13 14:44:55,701: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:44:55,701: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:45:00,084: llmtf.base.darumeru/RWSD: Loading Dataset: 4.38s INFO: 2024-07-13 14:45:19,405: llmtf.base.darumeru/RWSD: Processing Dataset: 19.32s INFO: 2024-07-13 14:45:19,421: llmtf.base.darumeru/RWSD: Results for darumeru/RWSD: INFO: 2024-07-13 14:45:19,425: llmtf.base.darumeru/RWSD: {'acc': 0.5441176470588235} INFO: 2024-07-13 14:45:19,427: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:45:19,427: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:45:34,612: llmtf.base.darumeru/USE: Loading Dataset: 15.18s INFO: 2024-07-13 14:46:14,635: llmtf.base.darumeru/cp_para_en: Processing Dataset: 251.85s INFO: 2024-07-13 14:46:14,638: llmtf.base.darumeru/cp_para_en: Results for darumeru/cp_para_en: INFO: 2024-07-13 14:46:14,657: llmtf.base.darumeru/cp_para_en: {'symbol_per_token': 4.485777628533072, 'len': 0.999455845790753, 'lcs': 0.9727731185644367} INFO: 2024-07-13 14:46:14,658: llmtf.base.evaluator: Ended eval INFO: 2024-07-13 14:46:14,684: llmtf.base.evaluator: mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/cp_para_en darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree 0.684 0.247 0.397 0.432 0.850 0.481 0.544 0.973 0.977 1.000 0.998 0.768 0.351 0.875 INFO: 2024-07-13 14:48:58,982: llmtf.base.darumeru/USE: Processing Dataset: 204.37s INFO: 2024-07-13 14:48:58,999: llmtf.base.darumeru/USE: Results for darumeru/USE: INFO: 2024-07-13 14:48:59,004: llmtf.base.darumeru/USE: {'grade_norm': 0.18725490196078434} INFO: 2024-07-13 14:48:59,010: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] INFO: 2024-07-13 14:48:59,010: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] INFO: 2024-07-13 14:49:19,451: llmtf.base.russiannlp/rucola_custom: Loading Dataset: 20.44s INFO: 2024-07-13 14:50:14,250: llmtf.base.nlpcoreteam/enMMLU: Processing Dataset: 1036.87s INFO: 2024-07-13 14:50:14,255: llmtf.base.nlpcoreteam/enMMLU: Results for nlpcoreteam/enMMLU: INFO: 2024-07-13 14:50:14,302: llmtf.base.nlpcoreteam/enMMLU: metric subject abstract_algebra 0.350000 anatomy 0.696296 astronomy 0.730263 business_ethics 0.700000 clinical_knowledge 0.754717 college_biology 0.812500 college_chemistry 0.500000 college_computer_science 0.590000 college_mathematics 0.330000 college_medicine 0.670520 college_physics 0.470588 computer_security 0.780000 conceptual_physics 0.570213 econometrics 0.561404 electrical_engineering 0.634483 elementary_mathematics 0.439153 formal_logic 0.507937 global_facts 0.430000 high_school_biology 0.800000 high_school_chemistry 0.517241 high_school_computer_science 0.760000 high_school_european_history 0.787879 high_school_geography 0.843434 high_school_government_and_politics 0.922280 high_school_macroeconomics 0.671795 high_school_mathematics 0.381481 high_school_microeconomics 0.764706 high_school_physics 0.417219 high_school_psychology 0.847706 high_school_statistics 0.537037 high_school_us_history 0.833333 high_school_world_history 0.835443 human_aging 0.730942 human_sexuality 0.801527 international_law 0.818182 jurisprudence 0.759259 logical_fallacies 0.766871 machine_learning 0.544643 management 0.825243 marketing 0.901709 medical_genetics 0.830000 miscellaneous 0.842912 moral_disputes 0.751445 moral_scenarios 0.497207 nutrition 0.754902 philosophy 0.720257 prehistory 0.753086 professional_accounting 0.556738 professional_law 0.483051 professional_medicine 0.742647 professional_psychology 0.717320 public_relations 0.690909 security_studies 0.722449 sociology 0.840796 us_foreign_policy 0.840000 virology 0.512048 world_religions 0.818713 INFO: 2024-07-13 14:50:14,310: llmtf.base.nlpcoreteam/enMMLU: metric subject STEM 0.564712 humanities 0.717897 other (business, health, misc.) 0.710620 social sciences 0.768694 INFO: 2024-07-13 14:50:14,318: llmtf.base.nlpcoreteam/enMMLU: {'acc': 0.6904807286717012} INFO: 2024-07-13 14:50:14,385: llmtf.base.evaluator: Ended eval INFO: 2024-07-13 14:50:14,399: llmtf.base.evaluator: mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/USE darumeru/cp_para_en darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU 0.651 0.247 0.397 0.432 0.850 0.481 0.544 0.187 0.973 0.977 1.000 0.998 0.768 0.351 0.875 0.690 INFO: 2024-07-13 14:51:55,784: llmtf.base.darumeru/ruMMLU: Processing Dataset: 1262.30s INFO: 2024-07-13 14:51:55,788: llmtf.base.darumeru/ruMMLU: Results for darumeru/ruMMLU: INFO: 2024-07-13 14:51:55,799: llmtf.base.darumeru/ruMMLU: {'acc': 0.5138182180983737} INFO: 2024-07-13 14:51:55,888: llmtf.base.evaluator: Ended eval INFO: 2024-07-13 14:51:55,906: llmtf.base.evaluator: mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/USE darumeru/cp_para_en darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU 0.643 0.247 0.397 0.432 0.850 0.481 0.544 0.187 0.973 0.977 1.000 0.998 0.514 0.768 0.351 0.875 0.690 INFO: 2024-07-13 14:52:18,001: llmtf.base.russiannlp/rucola_custom: Processing Dataset: 178.55s INFO: 2024-07-13 14:52:18,002: llmtf.base.russiannlp/rucola_custom: Results for russiannlp/rucola_custom: INFO: 2024-07-13 14:52:18,035: llmtf.base.russiannlp/rucola_custom: {'acc': 0.7115177610333692, 'mcc': 0.3362227509262135} INFO: 2024-07-13 14:52:18,046: llmtf.base.evaluator: Ended eval INFO: 2024-07-13 14:52:18,077: llmtf.base.evaluator: mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/USE darumeru/cp_para_en darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU russiannlp/rucola_custom 0.636 0.247 0.397 0.432 0.850 0.481 0.544 0.187 0.973 0.977 1.000 0.998 0.514 0.768 0.351 0.875 0.690 0.524 INFO: 2024-07-13 14:59:07,852: llmtf.base.nlpcoreteam/ruMMLU: Processing Dataset: 1522.57s INFO: 2024-07-13 14:59:07,871: llmtf.base.nlpcoreteam/ruMMLU: Results for nlpcoreteam/ruMMLU: INFO: 2024-07-13 14:59:07,917: llmtf.base.nlpcoreteam/ruMMLU: metric subject abstract_algebra 0.330000 anatomy 0.511111 astronomy 0.651316 business_ethics 0.680000 clinical_knowledge 0.588679 college_biology 0.534722 college_chemistry 0.480000 college_computer_science 0.520000 college_mathematics 0.350000 college_medicine 0.549133 college_physics 0.352941 computer_security 0.720000 conceptual_physics 0.540426 econometrics 0.438596 electrical_engineering 0.572414 elementary_mathematics 0.417989 formal_logic 0.396825 global_facts 0.370000 high_school_biology 0.664516 high_school_chemistry 0.394089 high_school_computer_science 0.690000 high_school_european_history 0.763636 high_school_geography 0.666667 high_school_government_and_politics 0.647668 high_school_macroeconomics 0.553846 high_school_mathematics 0.348148 high_school_microeconomics 0.546218 high_school_physics 0.410596 high_school_psychology 0.682569 high_school_statistics 0.449074 high_school_us_history 0.691176 high_school_world_history 0.734177 human_aging 0.538117 human_sexuality 0.641221 international_law 0.743802 jurisprudence 0.657407 logical_fallacies 0.558282 machine_learning 0.401786 management 0.689320 marketing 0.730769 medical_genetics 0.670000 miscellaneous 0.650064 moral_disputes 0.630058 moral_scenarios 0.382123 nutrition 0.604575 philosophy 0.614148 prehistory 0.574074 professional_accounting 0.397163 professional_law 0.397001 professional_medicine 0.514706 professional_psychology 0.514706 public_relations 0.609091 security_studies 0.657143 sociology 0.676617 us_foreign_policy 0.740000 virology 0.457831 world_religions 0.695906 INFO: 2024-07-13 14:59:07,924: llmtf.base.nlpcoreteam/ruMMLU: metric subject STEM 0.490445 humanities 0.602971 other (business, health, misc.) 0.567962 social sciences 0.614529 INFO: 2024-07-13 14:59:07,947: llmtf.base.nlpcoreteam/ruMMLU: {'acc': 0.5689766403256171} INFO: 2024-07-13 14:59:08,029: llmtf.base.evaluator: Ended eval INFO: 2024-07-13 14:59:08,049: llmtf.base.evaluator: mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/RWSD darumeru/USE darumeru/cp_para_en darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA darumeru/ruTiE darumeru/ruWorldTree nlpcoreteam/enMMLU nlpcoreteam/ruMMLU russiannlp/rucola_custom 0.632 0.247 0.397 0.432 0.850 0.481 0.544 0.187 0.973 0.977 1.000 0.998 0.514 0.768 0.351 0.875 0.690 0.569 0.524