bfuzzy1
/

Rodan-Base

@@ -166,8 +166,7 @@ v9 confirmed the ~11M ceiling and that PLE was dead weight, but since it didn't
 base. From here the work moves to the capability stages (chat, reasoning).
 What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
-(67.5) beats GPT-2-124M, and PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly
-above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
 data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
 reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
 chance, partly a capacity ceiling at this size and partly loglikelihood length-bias. It's a solid base for

 base. From here the work moves to the capability stages (chat, reasoning).
 What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
+(67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
 data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
 reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
 chance, partly a capacity ceiling at this size and partly loglikelihood length-bias. It's a solid base for