bfuzzy1
/

Rodan-Base

@@ -106,7 +106,7 @@ Intelligence per parameter (board avg vs log-params; the shaded region is above
 ![Intelligence per parameter](intelligence_per_param.png)
 The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
-sits roughly +0.3σ above the size-fit line — above-trend per parameter, ahead of liodon and the other
 similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
 models, which train on about 25B.
@@ -119,7 +119,7 @@ tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise
 gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
 showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
 out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
-ran, the board avg stayed near 35.8 — raw web lowered it, the leaner pure-curated mix matched v6 — so none of
 them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
 small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.
@@ -148,7 +148,7 @@ Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.
 | CommonsenseQA | acc | 20.7 | 20 |
 | **Board avg (÷4)** | | **35.80** | |
-For context — at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
 1/65th the tokens:
 | Model | Params | Tokens | Board avg (÷4) |
@@ -159,12 +159,12 @@ For context — at 11.46M it's just over the 10M line, but it outscores the sub-
 ![v6 benchmarks](v6_v9_metrics.png)
-v6 sits above the size-fit line (~+0.3σ) — above-trend per parameter, ahead of liodon. The v9 challenger
 (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
 v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
 here the work moved to the capability stages (chat, reasoning).
-What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
 (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
 data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
 reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near

 ![Intelligence per parameter](intelligence_per_param.png)
 The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
+sits roughly +0.3σ above the size-fit line, above-trend per parameter, ahead of liodon and the other
 similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
 models, which train on about 25B.
 gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
 showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
 out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
+ran, the board avg stayed near 35.8: raw web lowered it, the leaner pure-curated mix matched v6, so none of
 them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
 small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.
 | CommonsenseQA | acc | 20.7 | 20 |
 | **Board avg (÷4)** | | **35.80** | |
+For context, at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
 1/65th the tokens:
 | Model | Params | Tokens | Board avg (÷4) |
 ![v6 benchmarks](v6_v9_metrics.png)
+v6 sits above the size-fit line (~+0.3σ), above-trend per parameter, ahead of liodon. The v9 challenger
 (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
 v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
 here the work moved to the capability stages (chat, reasoning).
+What the model is actually like: it's solid for 11M on commonsense and science multiple-choice. SciQ
 (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
 data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
 reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near