bfuzzy1 commited on
Commit
ec9d71f
·
verified ·
1 Parent(s): f33ebff

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -106,7 +106,7 @@ Intelligence per parameter (board avg vs log-params; the shaded region is above
106
  ![Intelligence per parameter](intelligence_per_param.png)
107
 
108
  The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
109
- sits roughly +0.3σ above the size-fit line above-trend per parameter, ahead of liodon and the other
110
  similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
111
  models, which train on about 25B.
112
 
@@ -119,7 +119,7 @@ tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise
119
  gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
120
  showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
121
  out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
122
- ran, the board avg stayed near 35.8 raw web lowered it, the leaner pure-curated mix matched v6 so none of
123
  them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
124
  small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.
125
 
@@ -148,7 +148,7 @@ Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.
148
  | CommonsenseQA | acc | 20.7 | 20 |
149
  | **Board avg (÷4)** | | **35.80** | |
150
 
151
- For context at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
152
  1/65th the tokens:
153
 
154
  | Model | Params | Tokens | Board avg (÷4) |
@@ -159,12 +159,12 @@ For context — at 11.46M it's just over the 10M line, but it outscores the sub-
159
 
160
  ![v6 benchmarks](v6_v9_metrics.png)
161
 
162
- v6 sits above the size-fit line (~+0.3σ) above-trend per parameter, ahead of liodon. The v9 challenger
163
  (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
164
  v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
165
  here the work moved to the capability stages (chat, reasoning).
166
 
167
- What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
168
  (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
169
  data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
170
  reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
 
106
  ![Intelligence per parameter](intelligence_per_param.png)
107
 
108
  The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
109
+ sits roughly +0.3σ above the size-fit line, above-trend per parameter, ahead of liodon and the other
110
  similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
111
  models, which train on about 25B.
112
 
 
119
  gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
120
  showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
121
  out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
122
+ ran, the board avg stayed near 35.8: raw web lowered it, the leaner pure-curated mix matched v6, so none of
123
  them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
124
  small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.
125
 
 
148
  | CommonsenseQA | acc | 20.7 | 20 |
149
  | **Board avg (÷4)** | | **35.80** | |
150
 
151
+ For context, at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
152
  1/65th the tokens:
153
 
154
  | Model | Params | Tokens | Board avg (÷4) |
 
159
 
160
  ![v6 benchmarks](v6_v9_metrics.png)
161
 
162
+ v6 sits above the size-fit line (~+0.3σ), above-trend per parameter, ahead of liodon. The v9 challenger
163
  (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
164
  v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
165
  here the work moved to the capability stages (chat, reasoning).
166
 
167
+ What the model is actually like: it's solid for 11M on commonsense and science multiple-choice. SciQ
168
  (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
169
  data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
170
  reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near