bfuzzy1
/

Rodan-Base

@@ -106,8 +106,8 @@ Intelligence per parameter (board avg vs log-params; the shaded region is above
 ![Intelligence per parameter](intelligence_per_param.png)
 The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
-sits roughly +0.3σ above the size-fit line — above-trend per parameter, ahead of liodon, and well clear of
-the per-param underachievers lower in the field. It does this on roughly 1/65th the tokens of the leading
 models, which train on about 25B.
 Training loss and data mix, v6 vs v9:
@@ -118,11 +118,10 @@ v9 starts from v6, drops the dead PLE down to 10.41M, and trains on the pure-cur
 tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise, at 9% fewer parameters. It
 gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
 showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
-out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. And ~35.8 looks like a
-real ceiling for an 11M model on this board: raw web sinks it, the leaner pure-curated mix holds it, and
-nothing we tried pushed past it. So v6 stays the packaged base, and the next gains have to come from
-capability stages rather than more base pretraining. Unique tokens stay around 0.5B the whole way, about
-1/50th of what the leaders use.
 ## Evaluation
@@ -162,14 +161,14 @@ For context — at 11.46M it's just over the 10M line, but it outscores the sub-
 v6 sits above the size-fit line (~+0.3σ) — above-trend per parameter, ahead of liodon. The v9 challenger
 (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
-v9 confirmed the ~11M ceiling and that PLE was dead weight, but since it didn't move the board, v6 stays the
-base. From here the work moves to the capability stages (chat, reasoning).
 What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
 (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
 data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
 reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
-chance, partly a capacity ceiling at this size and partly loglikelihood length-bias. It's a solid base for
 discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.
 ## Limitations

 ![Intelligence per parameter](intelligence_per_param.png)
 The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
+sits roughly +0.3σ above the size-fit line — above-trend per parameter, ahead of liodon and the other
+similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
 models, which train on about 25B.
 Training loss and data mix, v6 vs v9:
 tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise, at 9% fewer parameters. It
 gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
 showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
+out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
+ran, the board avg stayed near 35.8 — raw web lowered it, the leaner pure-curated mix matched v6 — so none of
+them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
+small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.
 ## Evaluation
 v6 sits above the size-fit line (~+0.3σ) — above-trend per parameter, ahead of liodon. The v9 challenger
 (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
+v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
+here the work moved to the capability stages (chat, reasoning).
 What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
 (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
 data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
 reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
+chance, partly the limited capacity at this size and partly loglikelihood length-bias. It's a solid base for
 discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.
 ## Limitations