| INFO: 2024-07-13 13:32:05,327: llmtf.base.evaluator: Starting eval on ['darumeru/cp_sent_ru', 'darumeru/cp_sent_en', 'darumeru/cp_para_ru', 'darumeru/cp_para_en'] |
| INFO: 2024-07-13 13:32:05,328: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:32:05,328: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:32:07,455: llmtf.base.evaluator: Starting eval on ['darumeru/multiq', 'darumeru/parus', 'darumeru/rcb', 'darumeru/ruopenbookqa', 'darumeru/rutie', 'darumeru/ruworldtree', 'darumeru/rwsd', 'darumeru/use', 'russiannlp/rucola_custom'] |
| INFO: 2024-07-13 13:32:07,455: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:32:07,455: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:32:07,493: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/enmmlu'] |
| INFO: 2024-07-13 13:32:07,494: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:32:07,494: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:32:07,655: llmtf.base.evaluator: Starting eval on ['daru/treewayextractive'] |
| INFO: 2024-07-13 13:32:07,655: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:32:07,655: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:32:07,686: llmtf.base.evaluator: Starting eval on ['darumeru/rummlu'] |
| INFO: 2024-07-13 13:32:07,686: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:32:07,686: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:32:07,745: llmtf.base.evaluator: Starting eval on ['nlpcoreteam/rummlu'] |
| INFO: 2024-07-13 13:32:07,746: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:32:07,746: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:32:07,865: llmtf.base.evaluator: Starting eval on ['daru/treewayabstractive'] |
| INFO: 2024-07-13 13:32:07,866: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:32:07,866: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:32:08,209: llmtf.base.darumeru/cp_sent_ru: Loading Dataset: 2.88s |
| INFO: 2024-07-13 13:32:11,413: llmtf.base.darumeru/MultiQ: Loading Dataset: 3.96s |
| INFO: 2024-07-13 13:32:12,529: llmtf.base.daru/treewayabstractive: Loading Dataset: 4.66s |
| INFO: 2024-07-13 13:32:16,292: llmtf.base.darumeru/ruMMLU: Loading Dataset: 8.61s |
| INFO: 2024-07-13 13:32:19,927: llmtf.base.daru/treewayextractive: Loading Dataset: 12.27s |
| INFO: 2024-07-13 13:34:12,990: llmtf.base.darumeru/cp_sent_ru: Processing Dataset: 124.78s |
| INFO: 2024-07-13 13:34:12,992: llmtf.base.darumeru/cp_sent_ru: Results for darumeru/cp_sent_ru: |
| INFO: 2024-07-13 13:34:12,996: llmtf.base.darumeru/cp_sent_ru: {'symbol_per_token': 2.82914536342449, 'len': 0.993343330079649, 'lcs': 0.953698746500658} |
| INFO: 2024-07-13 13:34:12,997: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:34:12,998: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:34:15,015: llmtf.base.darumeru/cp_sent_en: Loading Dataset: 2.02s |
| INFO: 2024-07-13 13:34:20,882: llmtf.base.nlpcoreteam/enMMLU: Loading Dataset: 133.39s |
| INFO: 2024-07-13 13:34:23,420: llmtf.base.nlpcoreteam/ruMMLU: Loading Dataset: 135.67s |
| INFO: 2024-07-13 13:35:39,473: llmtf.base.darumeru/cp_sent_en: Processing Dataset: 84.46s |
| INFO: 2024-07-13 13:35:39,476: llmtf.base.darumeru/cp_sent_en: Results for darumeru/cp_sent_en: |
| INFO: 2024-07-13 13:35:39,481: llmtf.base.darumeru/cp_sent_en: {'symbol_per_token': 4.424907714143083, 'len': 0.9996416196590585, 'lcs': 0.995460815828734} |
| INFO: 2024-07-13 13:35:39,482: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:35:39,482: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:35:41,571: llmtf.base.darumeru/cp_para_ru: Loading Dataset: 2.09s |
| INFO: 2024-07-13 13:36:31,701: llmtf.base.darumeru/MultiQ: Processing Dataset: 260.29s |
| INFO: 2024-07-13 13:36:31,704: llmtf.base.darumeru/MultiQ: Results for darumeru/MultiQ: |
| INFO: 2024-07-13 13:36:31,722: llmtf.base.darumeru/MultiQ: {'f1': 0.3347231559740819, 'em': 0.2055449330783939} |
| INFO: 2024-07-13 13:36:31,726: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:36:31,726: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:36:33,871: llmtf.base.darumeru/PARus: Loading Dataset: 2.14s |
| INFO: 2024-07-13 13:36:36,545: llmtf.base.darumeru/PARus: Processing Dataset: 2.67s |
| INFO: 2024-07-13 13:36:36,562: llmtf.base.darumeru/PARus: Results for darumeru/PARus: |
| INFO: 2024-07-13 13:36:36,574: llmtf.base.darumeru/PARus: {'acc': 0.64} |
| INFO: 2024-07-13 13:36:36,575: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:36:36,575: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:36:38,646: llmtf.base.darumeru/RCB: Loading Dataset: 2.07s |
| INFO: 2024-07-13 13:36:44,261: llmtf.base.darumeru/RCB: Processing Dataset: 5.61s |
| INFO: 2024-07-13 13:36:44,263: llmtf.base.darumeru/RCB: Results for darumeru/RCB: |
| INFO: 2024-07-13 13:36:44,269: llmtf.base.darumeru/RCB: {'acc': 0.4954545454545455, 'f1_macro': 0.42697840772670775} |
| INFO: 2024-07-13 13:36:44,270: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:36:44,270: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:36:47,516: llmtf.base.darumeru/ruOpenBookQA: Loading Dataset: 3.25s |
| INFO: 2024-07-13 13:37:20,884: llmtf.base.darumeru/ruOpenBookQA: Processing Dataset: 33.37s |
| INFO: 2024-07-13 13:37:20,885: llmtf.base.darumeru/ruOpenBookQA: Results for darumeru/ruOpenBookQA: |
| INFO: 2024-07-13 13:37:20,912: llmtf.base.darumeru/ruOpenBookQA: {'acc': 0.6911512027491409, 'f1_macro': 0.6914435564575607} |
| INFO: 2024-07-13 13:37:20,918: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:37:20,918: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:37:28,618: llmtf.base.darumeru/ruTiE: Loading Dataset: 7.70s |
| INFO: 2024-07-13 13:38:01,706: llmtf.base.daru/treewayabstractive: Processing Dataset: 349.18s |
| INFO: 2024-07-13 13:38:01,711: llmtf.base.daru/treewayabstractive: Results for daru/treewayabstractive: |
| INFO: 2024-07-13 13:38:01,715: llmtf.base.daru/treewayabstractive: {'rouge1': 0.35425172563213586, 'rouge2': 0.12878361258702994} |
| INFO: 2024-07-13 13:38:01,717: llmtf.base.evaluator: Ended eval |
| INFO: 2024-07-13 13:38:01,742: llmtf.base.evaluator: |
| mean daru/treewayabstractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruOpenBookQA |
| 0.614 0.242 0.270 0.640 0.461 1.000 0.993 0.691 |
| INFO: 2024-07-13 13:38:18,320: llmtf.base.darumeru/ruTiE: Processing Dataset: 49.68s |
| ERROR: 2024-07-13 13:38:18,323: llmtf.base.evaluator: CUDA out of memory. Tried to allocate 29.55 GiB. GPU |
| ERROR: 2024-07-13 13:38:18,344: llmtf.base.evaluator: Traceback (most recent call last): |
| File "/scratch/tikhomirov/workdir/projects/llmtf_open/llmtf/evaluator.py", line 42, in evaluate |
| self.evaluate_dataset(task, model, output_dir, prompt_max_len, few_shot_count, generation_config, batch_size, max_sample_per_dataset) |
| File "/scratch/tikhomirov/workdir/projects/llmtf_open/llmtf/evaluator.py", line 65, in evaluate_dataset |
| prompts, y_preds, infos = getattr(model, task.method + '_batch')(**messages_batch) |
| File "/scratch/tikhomirov/workdir/projects/llmtf_open/llmtf/model.py", line 366, in calculate_tokens_proba_batch |
| outputs = self.model(**data) |
| File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl |
| return self._call_impl(*args, **kwargs) |
| File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl |
| return forward_call(*args, **kwargs) |
| File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1196, in forward |
| logits = logits.float() |
| torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 29.55 GiB. GPU |
|
|
| INFO: 2024-07-13 13:38:40,987: llmtf.base.nlpcoreteam/enMMLU: Processing Dataset: 260.10s |
| INFO: 2024-07-13 13:38:40,988: llmtf.base.nlpcoreteam/enMMLU: Results for nlpcoreteam/enMMLU: |
| INFO: 2024-07-13 13:38:41,032: llmtf.base.nlpcoreteam/enMMLU: metric |
| subject |
| abstract_algebra 0.360000 |
| anatomy 0.718519 |
| astronomy 0.736842 |
| business_ethics 0.730000 |
| clinical_knowledge 0.735849 |
| college_biology 0.791667 |
| college_chemistry 0.470000 |
| college_computer_science 0.600000 |
| college_mathematics 0.300000 |
| college_medicine 0.647399 |
| college_physics 0.490196 |
| computer_security 0.760000 |
| conceptual_physics 0.574468 |
| econometrics 0.517544 |
| electrical_engineering 0.606897 |
| elementary_mathematics 0.481481 |
| formal_logic 0.523810 |
| global_facts 0.430000 |
| high_school_biology 0.806452 |
| high_school_chemistry 0.551724 |
| high_school_computer_science 0.730000 |
| high_school_european_history 0.733333 |
| high_school_geography 0.828283 |
| high_school_government_and_politics 0.865285 |
| high_school_macroeconomics 0.630769 |
| high_school_mathematics 0.374074 |
| high_school_microeconomics 0.747899 |
| high_school_physics 0.410596 |
| high_school_psychology 0.856881 |
| high_school_statistics 0.546296 |
| high_school_us_history 0.828431 |
| high_school_world_history 0.839662 |
| human_aging 0.721973 |
| human_sexuality 0.778626 |
| international_law 0.760331 |
| jurisprudence 0.787037 |
| logical_fallacies 0.785276 |
| machine_learning 0.464286 |
| management 0.805825 |
| marketing 0.893162 |
| medical_genetics 0.780000 |
| miscellaneous 0.840358 |
| moral_disputes 0.687861 |
| moral_scenarios 0.293855 |
| nutrition 0.764706 |
| philosophy 0.717042 |
| prehistory 0.700617 |
| professional_accounting 0.539007 |
| professional_law 0.482399 |
| professional_medicine 0.738971 |
| professional_psychology 0.676471 |
| public_relations 0.645455 |
| security_studies 0.714286 |
| sociology 0.825871 |
| us_foreign_policy 0.890000 |
| virology 0.487952 |
| world_religions 0.830409 |
| INFO: 2024-07-13 13:38:41,039: llmtf.base.nlpcoreteam/enMMLU: metric |
| subject |
| STEM 0.558610 |
| humanities 0.690005 |
| other (business, health, misc.) 0.702409 |
| social sciences 0.748114 |
| INFO: 2024-07-13 13:38:41,047: llmtf.base.nlpcoreteam/enMMLU: {'acc': 0.6747843602567992} |
| INFO: 2024-07-13 13:38:41,076: llmtf.base.evaluator: Ended eval |
| INFO: 2024-07-13 13:38:41,085: llmtf.base.evaluator: |
| mean daru/treewayabstractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruOpenBookQA nlpcoreteam/enMMLU |
| 0.621 0.242 0.270 0.640 0.461 1.000 0.993 0.691 0.675 |
| INFO: 2024-07-13 13:38:44,800: llmtf.base.darumeru/cp_para_ru: Processing Dataset: 183.23s |
| INFO: 2024-07-13 13:38:44,802: llmtf.base.darumeru/cp_para_ru: Results for darumeru/cp_para_ru: |
| INFO: 2024-07-13 13:38:44,806: llmtf.base.darumeru/cp_para_ru: {'symbol_per_token': 2.9681178664729675, 'len': 0.9946813313624652, 'lcs': 0.9149641417867646} |
| INFO: 2024-07-13 13:38:44,806: llmtf.base.hfmodel: Updated generation_config.eos_token_id: [128001, 128009] |
| INFO: 2024-07-13 13:38:44,806: llmtf.base.hfmodel: Updated generation_config.stop_strings: [] |
| INFO: 2024-07-13 13:38:46,864: llmtf.base.darumeru/cp_para_en: Loading Dataset: 2.06s |
| INFO: 2024-07-13 13:39:03,671: llmtf.base.darumeru/ruMMLU: Processing Dataset: 407.38s |
| INFO: 2024-07-13 13:39:03,675: llmtf.base.darumeru/ruMMLU: Results for darumeru/ruMMLU: |
| INFO: 2024-07-13 13:39:03,682: llmtf.base.darumeru/ruMMLU: {'acc': 0.5040407063753367} |
| INFO: 2024-07-13 13:39:03,716: llmtf.base.evaluator: Ended eval |
| INFO: 2024-07-13 13:39:03,724: llmtf.base.evaluator: |
| mean daru/treewayabstractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA nlpcoreteam/enMMLU |
| 0.639 0.242 0.270 0.640 0.461 0.915 1.000 0.993 0.504 0.691 0.675 |
| INFO: 2024-07-13 13:39:21,125: llmtf.base.daru/treewayextractive: Processing Dataset: 421.20s |
| INFO: 2024-07-13 13:39:21,127: llmtf.base.daru/treewayextractive: Results for daru/treewayextractive: |
| INFO: 2024-07-13 13:39:21,343: llmtf.base.daru/treewayextractive: {'r-prec': 0.39497193362193367} |
| INFO: 2024-07-13 13:39:21,388: llmtf.base.evaluator: Ended eval |
| INFO: 2024-07-13 13:39:21,397: llmtf.base.evaluator: |
| mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA nlpcoreteam/enMMLU |
| 0.617 0.242 0.395 0.270 0.640 0.461 0.915 1.000 0.993 0.504 0.691 0.675 |
| INFO: 2024-07-13 13:40:36,169: llmtf.base.nlpcoreteam/ruMMLU: Processing Dataset: 372.74s |
| INFO: 2024-07-13 13:40:36,171: llmtf.base.nlpcoreteam/ruMMLU: Results for nlpcoreteam/ruMMLU: |
| INFO: 2024-07-13 13:40:36,214: llmtf.base.nlpcoreteam/ruMMLU: metric |
| subject |
| abstract_algebra 0.290000 |
| anatomy 0.459259 |
| astronomy 0.657895 |
| business_ethics 0.600000 |
| clinical_knowledge 0.562264 |
| college_biology 0.541667 |
| college_chemistry 0.400000 |
| college_computer_science 0.460000 |
| college_mathematics 0.320000 |
| college_medicine 0.497110 |
| college_physics 0.352941 |
| computer_security 0.570000 |
| conceptual_physics 0.472340 |
| econometrics 0.359649 |
| electrical_engineering 0.544828 |
| elementary_mathematics 0.417989 |
| formal_logic 0.396825 |
| global_facts 0.350000 |
| high_school_biology 0.632258 |
| high_school_chemistry 0.418719 |
| high_school_computer_science 0.610000 |
| high_school_european_history 0.715152 |
| high_school_geography 0.656566 |
| high_school_government_and_politics 0.595855 |
| high_school_macroeconomics 0.512821 |
| high_school_mathematics 0.333333 |
| high_school_microeconomics 0.500000 |
| high_school_physics 0.350993 |
| high_school_psychology 0.667890 |
| high_school_statistics 0.462963 |
| high_school_us_history 0.656863 |
| high_school_world_history 0.713080 |
| human_aging 0.547085 |
| human_sexuality 0.648855 |
| international_law 0.702479 |
| jurisprudence 0.592593 |
| logical_fallacies 0.527607 |
| machine_learning 0.357143 |
| management 0.669903 |
| marketing 0.700855 |
| medical_genetics 0.560000 |
| miscellaneous 0.641124 |
| moral_disputes 0.560694 |
| moral_scenarios 0.251397 |
| nutrition 0.594771 |
| philosophy 0.565916 |
| prehistory 0.561728 |
| professional_accounting 0.386525 |
| professional_law 0.356584 |
| professional_medicine 0.518382 |
| professional_psychology 0.482026 |
| public_relations 0.572727 |
| security_studies 0.620408 |
| sociology 0.696517 |
| us_foreign_policy 0.750000 |
| virology 0.421687 |
| world_religions 0.690058 |
| INFO: 2024-07-13 13:40:36,222: llmtf.base.nlpcoreteam/ruMMLU: metric |
| subject |
| STEM 0.455171 |
| humanities 0.560844 |
| other (business, health, misc.) 0.536355 |
| social sciences 0.588610 |
| INFO: 2024-07-13 13:40:36,259: llmtf.base.nlpcoreteam/ruMMLU: {'acc': 0.5352447672872421} |
| INFO: 2024-07-13 13:40:36,291: llmtf.base.evaluator: Ended eval |
| INFO: 2024-07-13 13:40:36,320: llmtf.base.evaluator: |
| mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA nlpcoreteam/enMMLU nlpcoreteam/ruMMLU |
| 0.610 0.242 0.395 0.270 0.640 0.461 0.915 1.000 0.993 0.504 0.691 0.675 0.535 |
| INFO: 2024-07-13 13:40:51,661: llmtf.base.darumeru/cp_para_en: Processing Dataset: 124.80s |
| INFO: 2024-07-13 13:40:51,663: llmtf.base.darumeru/cp_para_en: Results for darumeru/cp_para_en: |
| INFO: 2024-07-13 13:40:51,667: llmtf.base.darumeru/cp_para_en: {'symbol_per_token': 4.463061170262149, 'len': 0.9941296296409974, 'lcs': 0.9527227031116661} |
| INFO: 2024-07-13 13:40:51,667: llmtf.base.evaluator: Ended eval |
| INFO: 2024-07-13 13:40:51,674: llmtf.base.evaluator: |
| mean daru/treewayabstractive daru/treewayextractive darumeru/MultiQ darumeru/PARus darumeru/RCB darumeru/cp_para_en darumeru/cp_para_ru darumeru/cp_sent_en darumeru/cp_sent_ru darumeru/ruMMLU darumeru/ruOpenBookQA nlpcoreteam/enMMLU nlpcoreteam/ruMMLU |
| 0.636 0.242 0.395 0.270 0.640 0.461 0.953 0.915 1.000 0.993 0.504 0.691 0.675 0.535 |
|
|