frankenstallm / source /eval /outputs /pipeline.log
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
2026-03-05 03:18:13 [INFO] ========================================================================
2026-03-05 03:18:13 [INFO] FRANKENSTALLM 3B β€” Full Evaluation Pipeline
2026-03-05 03:18:13 [INFO] ========================================================================
2026-03-05 03:18:13 [INFO] Checkpoint : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/checkpoints/korean_3b_fp8_run1/checkpoint-0057000
2026-03-05 03:18:13 [INFO] Tokenizer : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/tokenizer/korean_sp/tokenizer.json
2026-03-05 03:18:13 [INFO] Data dir : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/data
2026-03-05 03:18:13 [INFO] Output dir : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318
2026-03-05 03:18:13 [INFO] GPUs : [2, 3, 4, 5, 6, 7]
2026-03-05 03:18:13 [INFO] SEQ_LEN : 2048 STRIDE: 512 BATCH_SIZE: 32
2026-03-05 03:18:13 [INFO] Phases : phase0=run phase1=run phase2=run
2026-03-05 03:18:13 [INFO]
2026-03-05 03:18:13 [INFO] ------------------------------------------------------------------------
2026-03-05 03:18:13 [INFO] PHASE 0 β€” HF Checkpoint Conversion
2026-03-05 03:18:13 [INFO] ------------------------------------------------------------------------
2026-03-05 03:18:13 [INFO] Running: /usr/bin/python /PROJECT/0325120031_A/ghong/taketimes/llm-bang/scripts/convert_to_hf.py --checkpoint /PROJECT/0325120031_A/ghong/taketimes/llm-bang/checkpoints/korean_3b_fp8_run1/checkpoint-0057000 --output /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000 --tokenizer /PROJECT/0325120031_A/ghong/taketimes/llm-bang/tokenizer/korean_sp/tokenizer.json
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
Checkpoint : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/checkpoints/korean_3b_fp8_run1/checkpoint-0057000
Output : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000
Model : d_model=3072, n_layers=28, vocab_size=64000, use_fp8=True
Loading model.pt ...
Source keys: 255
Remapping weight names ...
Destination keys: 171
Saving model.safetensors ...
Saved config.json
Copied tokenizer: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/tokenizer/korean_sp/tokenizer.json -> /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000/tokenizer.json
Done! HF model saved to: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000
Verify: ls -lh /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000
2026-03-05 03:18:30 [INFO] HF checkpoint saved to: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000
2026-03-05 03:18:30 [INFO] Phase 0 complete in 17s.
2026-03-05 03:18:30 [INFO]
2026-03-05 03:18:30 [INFO] ------------------------------------------------------------------------
2026-03-05 03:18:30 [INFO] PHASE 1 β€” Internal Evaluation β€” 6 GPU Parallel
2026-03-05 03:18:30 [INFO] ------------------------------------------------------------------------
2026-03-05 03:18:30 [INFO] Submitted: GPU 5 β€” Calibration + Token NLL
2026-03-05 03:18:30 [INFO] Submitted: GPU 6 β€” Generation (15 prompts Γ— 4 temps)
2026-03-05 03:18:30 [INFO] Submitted: GPU 7 β€” Repetition grid (12 Γ— 5)
2026-03-05 03:18:30 [INFO] Submitted: GPU 2 β€” PPL: 3b_val.bin
2026-03-05 03:18:30 [INFO] Submitted: GPU 3 β€” PPL: korean_c4 + korean_val
2026-03-05 03:18:30 [INFO] Submitted: GPU 4 β€” PPL: hplt_ko + cc100_ko + PPL: 7 cosmo files + PPL: 7 remaining files
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
[PPL cuda:2] Loading model for 3b...
[PPL cuda:2] 3b: 75,681,623 tokens, 151.4 MB
[PPL_MULTI cuda:3] Loading model once for 2 files...
[PPL cuda:3] korean_c4: 15,159,838 tokens, 30.3 MB
[CALIB cuda:5] Loading model...
[CALIB cuda:5] Using 50,000 tokens from 3b_val.bin
[PPL_MULTI cuda:4] Loading model once for 16 files...
[PPL cuda:4] hplt_ko: 16,165,735 tokens, 32.3 MB
[DCTN-0301014756:3083641:0:3084057] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
2026-03-05 03:18:48 [ERROR] [FAILED] GPU 5 β€” Calibration + Token NLL
Traceback (most recent call last):
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2026-03-05 03:18:48 [ERROR] [FAILED] GPU 6 β€” Generation (15 prompts Γ— 4 temps)
Traceback (most recent call last):
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2026-03-05 03:18:48 [ERROR] [FAILED] GPU 7 β€” Repetition grid (12 Γ— 5)
Traceback (most recent call last):
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2026-03-05 03:18:48 [ERROR] [FAILED] GPU 2 β€” PPL: 3b_val.bin
Traceback (most recent call last):
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2026-03-05 03:18:48 [ERROR] [FAILED] GPU 3 β€” PPL: korean_c4 + korean_val
Traceback (most recent call last):
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2026-03-05 03:18:48 [ERROR] [FAILED] GPU 4 β€” PPL: hplt_ko + cc100_ko + PPL: 7 cosmo files + PPL: 7 remaining files
Traceback (most recent call last):
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/full_eval_pipeline.py", line 515, in run_phase1
result = fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2026-03-05 03:18:50 [INFO] Phase 1 complete: 0 succeeded, 6 failed
2026-03-05 03:18:50 [INFO] Phase 1 results saved: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/phase1_results.json
2026-03-05 03:18:50 [INFO] Phase 1 complete in 19s.
2026-03-05 03:18:50 [INFO]
2026-03-05 03:18:50 [INFO] ------------------------------------------------------------------------
2026-03-05 03:18:50 [INFO] PHASE 2 β€” lm-eval Benchmarks β€” 6 GPU Parallel
2026-03-05 03:18:50 [INFO] ------------------------------------------------------------------------
2026-03-05 03:18:50 [INFO] Running 0-shot benchmarks on 8 GPUs ...
2026-03-05 03:18:50 [INFO] Submitted: [0-shot] GPU 2 β€” KoBEST: boolq + copa
2026-03-05 03:18:50 [INFO] Submitted: [0-shot] GPU 3 β€” KoBEST: hellaswag + sentineg
2026-03-05 03:18:50 [INFO] Submitted: [0-shot] GPU 4 β€” KoBEST: wic
2026-03-05 03:18:50 [INFO] Submitted: [0-shot] GPU 5 β€” HAE-RAE (all subtasks)
2026-03-05 03:18:50 [INFO] Submitted: [0-shot] GPU 6 β€” MMLU-KO part 1/2
2026-03-05 03:18:50 [INFO] Submitted: [0-shot] GPU 7 β€” MMLU-KO part 2/2
[Phase 2 [0-shot] GPU 2 β€” KoBEST: boolq + copa] Starting on cuda:2 β€” tasks: ['kobest_boolq', 'kobest_copa'], 0-shot
[Phase 2 [0-shot] GPU 3 β€” KoBEST: hellaswag + sentineg] Starting on cuda:3 β€” tasks: ['kobest_hellaswag', 'kobest_sentineg'], 0-shot
[Phase 2 [0-shot] GPU 4 β€” KoBEST: wic] Starting on cuda:4 β€” tasks: ['kobest_wic'], 0-shot
[Phase 2 [0-shot] GPU 6 β€” MMLU-KO part 1/2] Starting on cuda:6 β€” tasks: ['global_mmlu_ko_abstract_algebra', 'global_mmlu_ko_anatomy', 'global_mmlu_ko_astronomy', 'global_mmlu_ko_business_ethics', 'global_mmlu_ko_clinical_knowledge', 'global_mmlu_ko_college_biology', 'global_mmlu_ko_college_chemistry', 'global_mmlu_ko_college_computer_science', 'global_mmlu_ko_college_mathematics', 'global_mmlu_ko_college_medicine', 'global_mmlu_ko_college_physics', 'global_mmlu_ko_computer_security', 'global_mmlu_ko_conceptual_physics', 'global_mmlu_ko_econometrics', 'global_mmlu_ko_electrical_engineering', 'global_mmlu_ko_elementary_mathematics', 'global_mmlu_ko_formal_logic', 'global_mmlu_ko_global_facts', 'global_mmlu_ko_high_school_biology', 'global_mmlu_ko_high_school_chemistry', 'global_mmlu_ko_high_school_computer_science', 'global_mmlu_ko_high_school_european_history', 'global_mmlu_ko_high_school_geography', 'global_mmlu_ko_high_school_government_and_politics', 'global_mmlu_ko_high_school_macroeconomics', 'global_mmlu_ko_high_school_mathematics', 'global_mmlu_ko_high_school_microeconomics', 'global_mmlu_ko_high_school_physics', 'global_mmlu_ko_high_school_psychology'], 0-shot
[Phase 2 [0-shot] GPU 5 β€” HAE-RAE (all subtasks)] Starting on cuda:5 β€” tasks: ['haerae'], 0-shot
[Phase 2 [0-shot] GPU 7 β€” MMLU-KO part 2/2] Starting on cuda:7 β€” tasks: ['global_mmlu_ko_high_school_statistics', 'global_mmlu_ko_high_school_us_history', 'global_mmlu_ko_high_school_world_history', 'global_mmlu_ko_human_aging', 'global_mmlu_ko_human_sexuality', 'global_mmlu_ko_international_law', 'global_mmlu_ko_jurisprudence', 'global_mmlu_ko_logical_fallacies', 'global_mmlu_ko_machine_learning', 'global_mmlu_ko_management', 'global_mmlu_ko_marketing', 'global_mmlu_ko_medical_genetics', 'global_mmlu_ko_miscellaneous', 'global_mmlu_ko_moral_disputes', 'global_mmlu_ko_moral_scenarios', 'global_mmlu_ko_nutrition', 'global_mmlu_ko_philosophy', 'global_mmlu_ko_prehistory', 'global_mmlu_ko_professional_accounting', 'global_mmlu_ko_professional_law', 'global_mmlu_ko_professional_medicine', 'global_mmlu_ko_professional_psychology', 'global_mmlu_ko_public_relations', 'global_mmlu_ko_security_studies', 'global_mmlu_ko_sociology', 'global_mmlu_ko_us_foreign_policy', 'global_mmlu_ko_virology', 'global_mmlu_ko_world_religions'], 0-shot
2026-03-05 03:18:50 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:50 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:50 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:50 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:50 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:50 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:53 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:53 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:53 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:53 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:53 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_statistics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_us_history' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_world_history' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_human_aging' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_human_sexuality' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_international_law' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_jurisprudence' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_logical_fallacies' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_machine_learning' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_management' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_marketing' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_medical_genetics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_miscellaneous' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_moral_disputes' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_moral_scenarios' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_nutrition' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_philosophy' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_prehistory' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_professional_accounting' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_professional_law' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_professional_medicine' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_professional_psychology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_public_relations' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_security_studies' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_sociology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_us_foreign_policy' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_virology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_world_religions' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_abstract_algebra' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_anatomy' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_astronomy' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_business_ethics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_clinical_knowledge' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_biology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_chemistry' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_computer_science' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_mathematics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_medicine' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_physics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_computer_security' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_conceptual_physics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_econometrics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_electrical_engineering' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_elementary_mathematics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_formal_logic' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_global_facts' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_biology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_chemistry' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_computer_science' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_european_history' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_geography' not found in lm_eval registry β€” skipping.
[LM_EVAL] Starting on cuda:7 (CUDA_VISIBLE_DEVICES=7), tasks=['global_mmlu_ko_high_school_statistics', 'global_mmlu_ko_high_school_us_history', 'global_mmlu_ko_high_school_world_history', 'global_mmlu_ko_human_aging', 'global_mmlu_ko_human_sexuality', 'global_mmlu_ko_international_law', 'global_mmlu_ko_jurisprudence', 'global_mmlu_ko_logical_fallacies', 'global_mmlu_ko_machine_learning', 'global_mmlu_ko_management', 'global_mmlu_ko_marketing', 'global_mmlu_ko_medical_genetics', 'global_mmlu_ko_miscellaneous', 'global_mmlu_ko_moral_disputes', 'global_mmlu_ko_moral_scenarios', 'global_mmlu_ko_nutrition', 'global_mmlu_ko_philosophy', 'global_mmlu_ko_prehistory', 'global_mmlu_ko_professional_accounting', 'global_mmlu_ko_professional_law', 'global_mmlu_ko_professional_medicine', 'global_mmlu_ko_professional_psychology', 'global_mmlu_ko_public_relations', 'global_mmlu_ko_security_studies', 'global_mmlu_ko_sociology', 'global_mmlu_ko_us_foreign_policy', 'global_mmlu_ko_virology', 'global_mmlu_ko_world_religions'], num_fewshot=0
[LM_EVAL] No valid tasks to evaluate.
[Phase 2 [0-shot] GPU 7 β€” MMLU-KO part 2/2] Done.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_government_and_politics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_macroeconomics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_mathematics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_microeconomics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_physics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_psychology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:53 [INFO] [DONE] [0-shot] GPU 7 β€” MMLU-KO part 2/2
[LM_EVAL] Starting on cuda:6 (CUDA_VISIBLE_DEVICES=6), tasks=['global_mmlu_ko_abstract_algebra', 'global_mmlu_ko_anatomy', 'global_mmlu_ko_astronomy', 'global_mmlu_ko_business_ethics', 'global_mmlu_ko_clinical_knowledge', 'global_mmlu_ko_college_biology', 'global_mmlu_ko_college_chemistry', 'global_mmlu_ko_college_computer_science', 'global_mmlu_ko_college_mathematics', 'global_mmlu_ko_college_medicine', 'global_mmlu_ko_college_physics', 'global_mmlu_ko_computer_security', 'global_mmlu_ko_conceptual_physics', 'global_mmlu_ko_econometrics', 'global_mmlu_ko_electrical_engineering', 'global_mmlu_ko_elementary_mathematics', 'global_mmlu_ko_formal_logic', 'global_mmlu_ko_global_facts', 'global_mmlu_ko_high_school_biology', 'global_mmlu_ko_high_school_chemistry', 'global_mmlu_ko_high_school_computer_science', 'global_mmlu_ko_high_school_european_history', 'global_mmlu_ko_high_school_geography', 'global_mmlu_ko_high_school_government_and_politics', 'global_mmlu_ko_high_school_macroeconomics', 'global_mmlu_ko_high_school_mathematics', 'global_mmlu_ko_high_school_microeconomics', 'global_mmlu_ko_high_school_physics', 'global_mmlu_ko_high_school_psychology'], num_fewshot=0
[LM_EVAL] No valid tasks to evaluate.
[Phase 2 [0-shot] GPU 6 β€” MMLU-KO part 1/2] Done.
2026-03-05 03:18:53 [INFO] [DONE] [0-shot] GPU 6 β€” MMLU-KO part 1/2
2026-03-05 03:18:53 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:53 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:54 [WARNING] [LM_EVAL] Batch evaluation failed (lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'). Falling back to per-task evaluation.
2026-03-05 03:18:54 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:54 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:54 [WARNING] [LM_EVAL] Task 'kobest_wic' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
[LM_EVAL] Starting on cuda:4 (CUDA_VISIBLE_DEVICES=4), tasks=['kobest_wic'], num_fewshot=0
[LM_EVAL] Evaluating 1 task(s) together: ['kobest_wic']
[LM_EVAL] Evaluating task 'kobest_wic' individually...
[LM_EVAL] Evaluation complete in 1.6s
[LM_EVAL] Skipped tasks: ['kobest_wic']
[Phase 2 [0-shot] GPU 4 β€” KoBEST: wic] Done.
2026-03-05 03:18:54 [INFO] [DONE] [0-shot] GPU 4 β€” KoBEST: wic
2026-03-05 03:18:55 [WARNING] [LM_EVAL] Batch evaluation failed (lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'). Falling back to per-task evaluation.
2026-03-05 03:18:55 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:55 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:55 [WARNING] [LM_EVAL] Task 'kobest_hellaswag' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
2026-03-05 03:18:55 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:55 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:55 [WARNING] [LM_EVAL] Task 'kobest_sentineg' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
[LM_EVAL] Starting on cuda:3 (CUDA_VISIBLE_DEVICES=3), tasks=['kobest_hellaswag', 'kobest_sentineg'], num_fewshot=0
[LM_EVAL] Evaluating 2 task(s) together: ['kobest_hellaswag', 'kobest_sentineg']
[LM_EVAL] Evaluating task 'kobest_hellaswag' individually...
[LM_EVAL] Evaluating task 'kobest_sentineg' individually...
[LM_EVAL] Evaluation complete in 1.6s
[LM_EVAL] Skipped tasks: ['kobest_hellaswag', 'kobest_sentineg']
[Phase 2 [0-shot] GPU 3 β€” KoBEST: hellaswag + sentineg] Done.
2026-03-05 03:18:55 [INFO] [DONE] [0-shot] GPU 3 β€” KoBEST: hellaswag + sentineg
2026-03-05 03:18:55 [WARNING] [LM_EVAL] Batch evaluation failed (lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'). Falling back to per-task evaluation.
2026-03-05 03:18:55 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:55 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:55 [WARNING] [LM_EVAL] Task 'kobest_boolq' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
2026-03-05 03:18:55 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:55 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:55 [WARNING] [LM_EVAL] Task 'kobest_copa' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
[LM_EVAL] Starting on cuda:2 (CUDA_VISIBLE_DEVICES=2), tasks=['kobest_boolq', 'kobest_copa'], num_fewshot=0
[LM_EVAL] Evaluating 2 task(s) together: ['kobest_boolq', 'kobest_copa']
[LM_EVAL] Evaluating task 'kobest_boolq' individually...
[LM_EVAL] Evaluating task 'kobest_copa' individually...
[LM_EVAL] Evaluation complete in 1.6s
[LM_EVAL] Skipped tasks: ['kobest_boolq', 'kobest_copa']
[Phase 2 [0-shot] GPU 2 β€” KoBEST: boolq + copa] Done.
2026-03-05 03:18:55 [INFO] [DONE] [0-shot] GPU 2 β€” KoBEST: boolq + copa
2026-03-05 03:18:55 [WARNING] [LM_EVAL] Batch evaluation failed (lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'). Falling back to per-task evaluation.
2026-03-05 03:18:55 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:55 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:55 [WARNING] [LM_EVAL] Task 'haerae' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
[LM_EVAL] Starting on cuda:5 (CUDA_VISIBLE_DEVICES=5), tasks=['haerae'], num_fewshot=0
[LM_EVAL] Evaluating 1 task(s) together: ['haerae']
[LM_EVAL] Evaluating task 'haerae' individually...
[LM_EVAL] Evaluation complete in 1.6s
[LM_EVAL] Skipped tasks: ['haerae']
[Phase 2 [0-shot] GPU 5 β€” HAE-RAE (all subtasks)] Done.
2026-03-05 03:18:55 [INFO] [DONE] [0-shot] GPU 5 β€” HAE-RAE (all subtasks)
2026-03-05 03:18:55 [INFO] Phase 2 (0-shot) complete: 6 succeeded, 0 failed
2026-03-05 03:18:55 [INFO] Attempting 5-shot benchmarks ...
2026-03-05 03:18:55 [INFO] Submitted: [5-shot] GPU 2 β€” KoBEST: boolq + copa
2026-03-05 03:18:55 [INFO] Submitted: [5-shot] GPU 3 β€” KoBEST: hellaswag + sentineg
2026-03-05 03:18:55 [INFO] Submitted: [5-shot] GPU 4 β€” KoBEST: wic
2026-03-05 03:18:55 [INFO] Submitted: [5-shot] GPU 5 β€” HAE-RAE (all subtasks)
2026-03-05 03:18:55 [INFO] Submitted: [5-shot] GPU 6 β€” MMLU-KO part 1/2
2026-03-05 03:18:55 [INFO] Submitted: [5-shot] GPU 7 β€” MMLU-KO part 2/2
[Phase 2 [5-shot] GPU 2 β€” KoBEST: boolq + copa] Starting on cuda:2 β€” tasks: ['kobest_boolq', 'kobest_copa'], 5-shot
[Phase 2 [5-shot] GPU 3 β€” KoBEST: hellaswag + sentineg] Starting on cuda:3 β€” tasks: ['kobest_hellaswag', 'kobest_sentineg'], 5-shot
[Phase 2 [5-shot] GPU 4 β€” KoBEST: wic] Starting on cuda:4 β€” tasks: ['kobest_wic'], 5-shot
[Phase 2 [5-shot] GPU 5 β€” HAE-RAE (all subtasks)] Starting on cuda:5 β€” tasks: ['haerae'], 5-shot
[Phase 2 [5-shot] GPU 6 β€” MMLU-KO part 1/2] Starting on cuda:6 β€” tasks: ['global_mmlu_ko_abstract_algebra', 'global_mmlu_ko_anatomy', 'global_mmlu_ko_astronomy', 'global_mmlu_ko_business_ethics', 'global_mmlu_ko_clinical_knowledge', 'global_mmlu_ko_college_biology', 'global_mmlu_ko_college_chemistry', 'global_mmlu_ko_college_computer_science', 'global_mmlu_ko_college_mathematics', 'global_mmlu_ko_college_medicine', 'global_mmlu_ko_college_physics', 'global_mmlu_ko_computer_security', 'global_mmlu_ko_conceptual_physics', 'global_mmlu_ko_econometrics', 'global_mmlu_ko_electrical_engineering', 'global_mmlu_ko_elementary_mathematics', 'global_mmlu_ko_formal_logic', 'global_mmlu_ko_global_facts', 'global_mmlu_ko_high_school_biology', 'global_mmlu_ko_high_school_chemistry', 'global_mmlu_ko_high_school_computer_science', 'global_mmlu_ko_high_school_european_history', 'global_mmlu_ko_high_school_geography', 'global_mmlu_ko_high_school_government_and_politics', 'global_mmlu_ko_high_school_macroeconomics', 'global_mmlu_ko_high_school_mathematics', 'global_mmlu_ko_high_school_microeconomics', 'global_mmlu_ko_high_school_physics', 'global_mmlu_ko_high_school_psychology'], 5-shot
[Phase 2 [5-shot] GPU 7 β€” MMLU-KO part 2/2] Starting on cuda:7 β€” tasks: ['global_mmlu_ko_high_school_statistics', 'global_mmlu_ko_high_school_us_history', 'global_mmlu_ko_high_school_world_history', 'global_mmlu_ko_human_aging', 'global_mmlu_ko_human_sexuality', 'global_mmlu_ko_international_law', 'global_mmlu_ko_jurisprudence', 'global_mmlu_ko_logical_fallacies', 'global_mmlu_ko_machine_learning', 'global_mmlu_ko_management', 'global_mmlu_ko_marketing', 'global_mmlu_ko_medical_genetics', 'global_mmlu_ko_miscellaneous', 'global_mmlu_ko_moral_disputes', 'global_mmlu_ko_moral_scenarios', 'global_mmlu_ko_nutrition', 'global_mmlu_ko_philosophy', 'global_mmlu_ko_prehistory', 'global_mmlu_ko_professional_accounting', 'global_mmlu_ko_professional_law', 'global_mmlu_ko_professional_medicine', 'global_mmlu_ko_professional_psychology', 'global_mmlu_ko_public_relations', 'global_mmlu_ko_security_studies', 'global_mmlu_ko_sociology', 'global_mmlu_ko_us_foreign_policy', 'global_mmlu_ko_virology', 'global_mmlu_ko_world_religions'], 5-shot
2026-03-05 03:18:56 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:56 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:56 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:56 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:56 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:56 [INFO] TensorFlow version 2.20.0 available.
2026-03-05 03:18:58 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:58 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:58 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:58 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:58 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:58 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:58 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:18:58 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_statistics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_us_history' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_world_history' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_human_aging' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_human_sexuality' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_international_law' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_jurisprudence' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_logical_fallacies' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_machine_learning' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_management' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_marketing' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_medical_genetics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_miscellaneous' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_moral_disputes' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_moral_scenarios' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_nutrition' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_philosophy' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_prehistory' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_professional_accounting' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_professional_law' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_professional_medicine' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_professional_psychology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_public_relations' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_security_studies' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_sociology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_us_foreign_policy' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_virology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_world_religions' not found in lm_eval registry β€” skipping.
[LM_EVAL] Starting on cuda:7 (CUDA_VISIBLE_DEVICES=7), tasks=['global_mmlu_ko_high_school_statistics', 'global_mmlu_ko_high_school_us_history', 'global_mmlu_ko_high_school_world_history', 'global_mmlu_ko_human_aging', 'global_mmlu_ko_human_sexuality', 'global_mmlu_ko_international_law', 'global_mmlu_ko_jurisprudence', 'global_mmlu_ko_logical_fallacies', 'global_mmlu_ko_machine_learning', 'global_mmlu_ko_management', 'global_mmlu_ko_marketing', 'global_mmlu_ko_medical_genetics', 'global_mmlu_ko_miscellaneous', 'global_mmlu_ko_moral_disputes', 'global_mmlu_ko_moral_scenarios', 'global_mmlu_ko_nutrition', 'global_mmlu_ko_philosophy', 'global_mmlu_ko_prehistory', 'global_mmlu_ko_professional_accounting', 'global_mmlu_ko_professional_law', 'global_mmlu_ko_professional_medicine', 'global_mmlu_ko_professional_psychology', 'global_mmlu_ko_public_relations', 'global_mmlu_ko_security_studies', 'global_mmlu_ko_sociology', 'global_mmlu_ko_us_foreign_policy', 'global_mmlu_ko_virology', 'global_mmlu_ko_world_religions'], num_fewshot=5
[LM_EVAL] No valid tasks to evaluate.
[Phase 2 [5-shot] GPU 7 β€” MMLU-KO part 2/2] Done.
2026-03-05 03:18:58 [INFO] [DONE] [5-shot] GPU 7 β€” MMLU-KO part 2/2
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_abstract_algebra' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_anatomy' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_astronomy' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_business_ethics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_clinical_knowledge' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_biology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_chemistry' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_computer_science' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_mathematics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_medicine' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_college_physics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_computer_security' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_conceptual_physics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_econometrics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_electrical_engineering' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_elementary_mathematics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_formal_logic' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_global_facts' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_biology' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_chemistry' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_computer_science' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_european_history' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_geography' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_government_and_politics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_macroeconomics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_mathematics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_microeconomics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_physics' not found in lm_eval registry β€” skipping.
2026-03-05 03:18:58 [WARNING] [LM_EVAL] Task 'global_mmlu_ko_high_school_psychology' not found in lm_eval registry β€” skipping.
[LM_EVAL] Starting on cuda:6 (CUDA_VISIBLE_DEVICES=6), tasks=['global_mmlu_ko_abstract_algebra', 'global_mmlu_ko_anatomy', 'global_mmlu_ko_astronomy', 'global_mmlu_ko_business_ethics', 'global_mmlu_ko_clinical_knowledge', 'global_mmlu_ko_college_biology', 'global_mmlu_ko_college_chemistry', 'global_mmlu_ko_college_computer_science', 'global_mmlu_ko_college_mathematics', 'global_mmlu_ko_college_medicine', 'global_mmlu_ko_college_physics', 'global_mmlu_ko_computer_security', 'global_mmlu_ko_conceptual_physics', 'global_mmlu_ko_econometrics', 'global_mmlu_ko_electrical_engineering', 'global_mmlu_ko_elementary_mathematics', 'global_mmlu_ko_formal_logic', 'global_mmlu_ko_global_facts', 'global_mmlu_ko_high_school_biology', 'global_mmlu_ko_high_school_chemistry', 'global_mmlu_ko_high_school_computer_science', 'global_mmlu_ko_high_school_european_history', 'global_mmlu_ko_high_school_geography', 'global_mmlu_ko_high_school_government_and_politics', 'global_mmlu_ko_high_school_macroeconomics', 'global_mmlu_ko_high_school_mathematics', 'global_mmlu_ko_high_school_microeconomics', 'global_mmlu_ko_high_school_physics', 'global_mmlu_ko_high_school_psychology'], num_fewshot=5
[LM_EVAL] No valid tasks to evaluate.
[Phase 2 [5-shot] GPU 6 β€” MMLU-KO part 1/2] Done.
2026-03-05 03:18:58 [INFO] [DONE] [5-shot] GPU 6 β€” MMLU-KO part 1/2
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Batch evaluation failed (lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'). Falling back to per-task evaluation.
2026-03-05 03:19:00 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:19:00 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Task 'kobest_hellaswag' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
2026-03-05 03:19:00 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:19:00 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Task 'kobest_sentineg' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
[LM_EVAL] Starting on cuda:3 (CUDA_VISIBLE_DEVICES=3), tasks=['kobest_hellaswag', 'kobest_sentineg'], num_fewshot=5
[LM_EVAL] Evaluating 2 task(s) together: ['kobest_hellaswag', 'kobest_sentineg']
[LM_EVAL] Evaluating task 'kobest_hellaswag' individually...
[LM_EVAL] Evaluating task 'kobest_sentineg' individually...
[LM_EVAL] Evaluation complete in 1.6s
[LM_EVAL] Skipped tasks: ['kobest_hellaswag', 'kobest_sentineg']
[Phase 2 [5-shot] GPU 3 β€” KoBEST: hellaswag + sentineg] Done.
2026-03-05 03:19:00 [INFO] [DONE] [5-shot] GPU 3 β€” KoBEST: hellaswag + sentineg
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Batch evaluation failed (lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'). Falling back to per-task evaluation.
2026-03-05 03:19:00 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:19:00 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Task 'kobest_boolq' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
2026-03-05 03:19:00 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:19:00 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Task 'kobest_copa' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
[LM_EVAL] Starting on cuda:2 (CUDA_VISIBLE_DEVICES=2), tasks=['kobest_boolq', 'kobest_copa'], num_fewshot=5
[LM_EVAL] Evaluating 2 task(s) together: ['kobest_boolq', 'kobest_copa']
[LM_EVAL] Evaluating task 'kobest_boolq' individually...
[LM_EVAL] Evaluating task 'kobest_copa' individually...
[LM_EVAL] Evaluation complete in 1.6s
[LM_EVAL] Skipped tasks: ['kobest_boolq', 'kobest_copa']
[Phase 2 [5-shot] GPU 2 β€” KoBEST: boolq + copa] Done.
2026-03-05 03:19:00 [INFO] [DONE] [5-shot] GPU 2 β€” KoBEST: boolq + copa
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Batch evaluation failed (lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'). Falling back to per-task evaluation.
2026-03-05 03:19:00 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:19:00 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Task 'haerae' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
[LM_EVAL] Starting on cuda:5 (CUDA_VISIBLE_DEVICES=5), tasks=['haerae'], num_fewshot=5
[LM_EVAL] Evaluating 1 task(s) together: ['haerae']
[LM_EVAL] Evaluating task 'haerae' individually...
[LM_EVAL] Evaluation complete in 1.7s
[LM_EVAL] Skipped tasks: ['haerae']
[Phase 2 [5-shot] GPU 5 β€” HAE-RAE (all subtasks)] Done.
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Batch evaluation failed (lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'). Falling back to per-task evaluation.
2026-03-05 03:19:00 [INFO] [DONE] [5-shot] GPU 5 β€” HAE-RAE (all subtasks)
2026-03-05 03:19:00 [INFO] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-05 03:19:00 [INFO] Initializing hf model, with arguments: {'pretrained':
'/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/hf_3b_checkpoint-0057000', 'dtype':
'bfloat16', 'device': 'cuda:0'}
2026-03-05 03:19:00 [WARNING] [LM_EVAL] Task 'kobest_wic' failed: lm_eval.models.huggingface.HFLM() got multiple values for keyword argument 'device'
[LM_EVAL] Starting on cuda:4 (CUDA_VISIBLE_DEVICES=4), tasks=['kobest_wic'], num_fewshot=5
[LM_EVAL] Evaluating 1 task(s) together: ['kobest_wic']
[LM_EVAL] Evaluating task 'kobest_wic' individually...
[LM_EVAL] Evaluation complete in 1.7s
[LM_EVAL] Skipped tasks: ['kobest_wic']
[Phase 2 [5-shot] GPU 4 β€” KoBEST: wic] Done.
2026-03-05 03:19:00 [INFO] [DONE] [5-shot] GPU 4 β€” KoBEST: wic
2026-03-05 03:19:01 [INFO] Phase 2 (5-shot) complete: 6 succeeded, 0 failed
2026-03-05 03:19:01 [INFO] Phase 2 results saved: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/phase2_results.json
2026-03-05 03:19:01 [INFO] Phase 2 complete in 11s.
2026-03-05 03:19:01 [INFO]
2026-03-05 03:19:01 [INFO] ------------------------------------------------------------------------
2026-03-05 03:19:01 [INFO] PHASE 3 β€” Report Generation
2026-03-05 03:19:01 [INFO] ------------------------------------------------------------------------
2026-03-05 03:19:01 [INFO] Report saved: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/full_eval_report.md
2026-03-05 03:19:01 [INFO] Phase 3 complete in 0s.
2026-03-05 03:19:01 [INFO] ========================================================================
2026-03-05 03:19:01 [INFO] PIPELINE COMPLETE
2026-03-05 03:19:01 [INFO] ========================================================================
2026-03-05 03:19:01 [INFO] Total time : 47s
2026-03-05 03:19:01 [INFO] Output dir : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318
2026-03-05 03:19:01 [INFO] Phase 1 results : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/phase1_results.json
2026-03-05 03:19:01 [INFO] Phase 2 results : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/phase2_results.json
2026-03-05 03:19:01 [INFO] Gen samples : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/generation_samples.json
2026-03-05 03:19:01 [INFO] Report : /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/outputs/3b_full_eval_20260305_0318/full_eval_report.md
2026-03-05 03:19:01 [INFO] Phase 1 tasks : 0 OK / 6 failed
2026-03-05 03:19:01 [INFO] Phase 2 tasks : 6 OK / 0 failed
2026-03-05 03:19:01 [INFO] ========================================================================
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 60 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '