## Base model evaluation timestamp: 2025-11-03 09:13:28 - Model: base_model (step 21400) - CORE metric: 0.2137 - hellaswag_zeroshot: 0.2687 - jeopardy: 0.1214 - bigbench_qa_wikidata: 0.5278 - arc_easy: 0.5314 - arc_challenge: 0.1251 - copa: 0.3600 - commonsense_qa: 0.1145 - piqa: 0.3917 - openbook_qa: 0.1360 - lambada_openai: 0.3549 - hellaswag: 0.2634 - winograd: 0.2601 - winogrande: 0.1018 - bigbench_dyck_languages: 0.1080 - agi_eval_lsat_ar: 0.1359 - bigbench_cs_algorithms: 0.3720 - bigbench_operators: 0.1429 - bigbench_repeat_copy_logic: 0.0000 - squad: 0.2528 - coqa: 0.1932 - boolq: -0.2369 - bigbench_language_identification: 0.1762