## Base model evaluation
timestamp: 2025-11-03 09:13:28

- Model: base_model (step 21400)
- CORE metric: 0.2137
- hellaswag_zeroshot: 0.2687
- jeopardy: 0.1214
- bigbench_qa_wikidata: 0.5278
- arc_easy: 0.5314
- arc_challenge: 0.1251
- copa: 0.3600
- commonsense_qa: 0.1145
- piqa: 0.3917
- openbook_qa: 0.1360
- lambada_openai: 0.3549
- hellaswag: 0.2634
- winograd: 0.2601
- winogrande: 0.1018
- bigbench_dyck_languages: 0.1080
- agi_eval_lsat_ar: 0.1359
- bigbench_cs_algorithms: 0.3720
- bigbench_operators: 0.1429
- bigbench_repeat_copy_logic: 0.0000
- squad: 0.2528
- coqa: 0.1932
- boolq: -0.2369
- bigbench_language_identification: 0.1762