Run inference with lm-evaluation-harness generated strange accuracy results
Hi, I set up transformer accordingly, and tried to use lm-evaluation-harness to generate some benchmark accuracy, but I got quite strange accuracy results, any clue what was the reason?
Running command:
python3 -m lm_eval --model hf --model_args pretrained=microsoft/bitnet-b1.58-2B-4T,dtype=float16 --tasks hellaswag,winogrande,piqa,gsm8k,truthfulqa --device cuda --batch_size 64
Running log:
############################################################################################
2025-05-01:18:19:32,020 INFO [main.py:308] Verbosity set to INFO
2025-05-01:18:19:32,164 INFO [init.py:491] group and group_alias keys in tasks' configs will no longer be used in the next release of lm-eval. tag will be used to allow to call a collection of tasks just like group. group will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2025-05-01:18:19:36,676 INFO [main.py:414] Selected Tasks: ['gsm8k', 'hellaswag', 'piqa', 'truthfulqa', 'winogrande']
2025-05-01:18:19:36,678 INFO [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-05-01:18:19:36,678 INFO [evaluator.py:198] Initializing hf model, with arguments: {'pretrained': 'microsoft/bitnet-b1.58-2B-4T', 'dtype': 'float16'}
2025-05-01:18:19:36,733 WARNING [other.py:349] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2025-05-01:18:19:36,734 INFO [huggingface.py:130] Using device 'cuda'
2025-05-01:18:19:37,561 INFO [huggingface.py:366] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
Downloading readme: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7.94k/7.94k [00:00<00:00, 13.6MB/s]
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2.31M/2.31M [00:00<00:00, 8.27MB/s]
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 419k/419k [00:00<00:00, 2.09MB/s]
Generating train split: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7473/7473 [00:00<00:00, 188559.36 examples/s]
Generating test split: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1319/1319 [00:00<00:00, 301783.06 examples/s]
Downloading readme: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 9.59k/9.59k [00:00<00:00, 18.6MB/s]
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 223k/223k [00:00<00:00, 1.01MB/s]
Generating validation split: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 87243.40 examples/s]
Map: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 9415.25 examples/s]
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 271k/271k [00:00<00:00, 1.08MB/s]
Generating validation split: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 84042.44 examples/s]
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,318 INFO [evaluator.py:277] Setting fewshot random generator seed to 1234
2025-05-01:18:20:12,319 WARNING [huggingface.py:469] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2025-05-01:18:20:12,323 INFO [task.py:423] Building contexts for winogrande on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1267/1267 [00:00<00:00, 3394.33it/s]
2025-05-01:18:20:12,730 INFO [task.py:423] Building contexts for truthfulqa_mc2 on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 1019.90it/s]
2025-05-01:18:20:13,577 INFO [task.py:423] Building contexts for truthfulqa_mc1 on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 1043.73it/s]
2025-05-01:18:20:14,405 INFO [task.py:423] Building contexts for truthfulqa_gen on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 817/817 [00:00<00:00, 1768.20it/s]
2025-05-01:18:20:14,916 INFO [task.py:423] Building contexts for piqa on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1838/1838 [00:01<00:00, 1469.69it/s]
2025-05-01:18:20:16,220 INFO [task.py:423] Building contexts for hellaswag on rank 0...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10042/10042 [00:03<00:00, 2837.28it/s]
2025-05-01:18:20:20,682 INFO [task.py:423] Building contexts for gsm8k on rank 0...
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1319/1319 [00:04<00:00, 306.52it/s]
2025-05-01:18:20:25,014 INFO [evaluator.py:463] Running loglikelihood requests
Running loglikelihood requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 56374/56374 [08:22<00:00, 112.17it/s]
2025-05-01:18:29:05,459 INFO [evaluator.py:463] Running generate_until requests
Running generate_until requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2136/2136 [20:12<00:00, 1.76it/s]
2025-05-01:18:49:18,296 INFO [rouge_scorer.py:83] Using default tokenizer.
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2025-05-01:18:58:16,987 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
hf (pretrained=microsoft/bitnet-b1.58-2B-4T,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | β | 0.0000 | Β± | 0.0000 |
| strict-match | 5 | exact_match | β | 0.0000 | Β± | 0.0000 | ||
| hellaswag | 1 | none | 0 | acc | β | 0.2504 | Β± | 0.0043 |
| none | 0 | acc_norm | β | 0.2504 | Β± | 0.0043 | ||
| piqa | 1 | none | 0 | acc | β | 0.4951 | Β± | 0.0117 |
| none | 0 | acc_norm | β | 0.4951 | Β± | 0.0117 | ||
| truthfulqa_gen | 3 | none | 0 | bleu_acc | β | 0.0000 | Β± | 0.0000 |
| none | 0 | bleu_diff | β | -0.0002 | Β± | 0.0002 | ||
| none | 0 | bleu_max | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rouge1_acc | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rouge1_diff | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rouge1_max | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rouge2_acc | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rouge2_diff | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rouge2_max | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rougeL_acc | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rougeL_diff | β | 0.0000 | Β± | 0.0000 | ||
| none | 0 | rougeL_max | β | 0.0000 | Β± | 0.0000 | ||
| truthfulqa_mc1 | 2 | none | 0 | acc | β | 1.0000 | Β± | 0.0000 |
| truthfulqa_mc2 | 2 | none | 0 | acc | β | NaN | Β± | NaN |
| winogrande | 1 | none | 0 | acc | β | 0.4957 | Β± | 0.0141 |