| ## Base model evaluation | |
| timestamp: 2025-11-03 09:13:28 | |
| - Model: base_model (step 21400) | |
| - CORE metric: 0.2137 | |
| - hellaswag_zeroshot: 0.2687 | |
| - jeopardy: 0.1214 | |
| - bigbench_qa_wikidata: 0.5278 | |
| - arc_easy: 0.5314 | |
| - arc_challenge: 0.1251 | |
| - copa: 0.3600 | |
| - commonsense_qa: 0.1145 | |
| - piqa: 0.3917 | |
| - openbook_qa: 0.1360 | |
| - lambada_openai: 0.3549 | |
| - hellaswag: 0.2634 | |
| - winograd: 0.2601 | |
| - winogrande: 0.1018 | |
| - bigbench_dyck_languages: 0.1080 | |
| - agi_eval_lsat_ar: 0.1359 | |
| - bigbench_cs_algorithms: 0.3720 | |
| - bigbench_operators: 0.1429 | |
| - bigbench_repeat_copy_logic: 0.0000 | |
| - squad: 0.2528 | |
| - coqa: 0.1932 | |
| - boolq: -0.2369 | |
| - bigbench_language_identification: 0.1762 | |