bfuzzy1 commited on
Commit
f33ebff
·
verified ·
1 Parent(s): 46e510c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -10
README.md CHANGED
@@ -106,8 +106,8 @@ Intelligence per parameter (board avg vs log-params; the shaded region is above
106
  ![Intelligence per parameter](intelligence_per_param.png)
107
 
108
  The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
109
- sits roughly +0.3σ above the size-fit line — above-trend per parameter, ahead of liodon, and well clear of
110
- the per-param underachievers lower in the field. It does this on roughly 1/65th the tokens of the leading
111
  models, which train on about 25B.
112
 
113
  Training loss and data mix, v6 vs v9:
@@ -118,11 +118,10 @@ v9 starts from v6, drops the dead PLE down to 10.41M, and trains on the pure-cur
118
  tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise, at 9% fewer parameters. It
119
  gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
120
  showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
121
- out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. And ~35.8 looks like a
122
- real ceiling for an 11M model on this board: raw web sinks it, the leaner pure-curated mix holds it, and
123
- nothing we tried pushed past it. So v6 stays the packaged base, and the next gains have to come from
124
- capability stages rather than more base pretraining. Unique tokens stay around 0.5B the whole way, about
125
- 1/50th of what the leaders use.
126
 
127
  ## Evaluation
128
 
@@ -162,14 +161,14 @@ For context — at 11.46M it's just over the 10M line, but it outscores the sub-
162
 
163
  v6 sits above the size-fit line (~+0.3σ) — above-trend per parameter, ahead of liodon. The v9 challenger
164
  (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
165
- v9 confirmed the ~11M ceiling and that PLE was dead weight, but since it didn't move the board, v6 stays the
166
- base. From here the work moves to the capability stages (chat, reasoning).
167
 
168
  What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
169
  (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
170
  data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
171
  reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
172
- chance, partly a capacity ceiling at this size and partly loglikelihood length-bias. It's a solid base for
173
  discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.
174
 
175
  ## Limitations
 
106
  ![Intelligence per parameter](intelligence_per_param.png)
107
 
108
  The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
109
+ sits roughly +0.3σ above the size-fit line — above-trend per parameter, ahead of liodon and the other
110
+ similar-size models that fall below the line. It does this on roughly 1/65th the tokens of the leading
111
  models, which train on about 25B.
112
 
113
  Training loss and data mix, v6 vs v9:
 
118
  tie: board avg 35.70 against v6's 35.80, a 0.10 gap that's well inside the noise, at 9% fewer parameters. It
119
  gave up about 1.7 points of HellaSwag and picked up 2.0 on ArithMark (28.4, the folded arithmetic finally
120
  showing), and the per-param number came out about even too (~+0.32σ vs v6's +0.31σ). Two conclusions fall
121
+ out of that. PLE really was dead weight, since cutting 1.05M params changed nothing. Across the variants we
122
+ ran, the board avg stayed near 35.8 raw web lowered it, the leaner pure-curated mix matched v6 — so none of
123
+ them beat the base, and v6 stays the packaged checkpoint. Unique tokens stay around 0.5B the whole way, a
124
+ small fraction of what the leading models use, so there is likely more to gain from additional curated tokens.
 
125
 
126
  ## Evaluation
127
 
 
161
 
162
  v6 sits above the size-fit line (~+0.3σ) — above-trend per parameter, ahead of liodon. The v9 challenger
163
  (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
164
+ v9 confirmed that PLE was dead weight, but since it didn't beat v6's board score, v6 stays the base. From
165
+ here the work moved to the capability stages (chat, reasoning).
166
 
167
  What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
168
  (67.5), PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks to the folded-in computation
169
  data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
170
  reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
171
+ chance, partly the limited capacity at this size and partly loglikelihood length-bias. It's a solid base for
172
  discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.
173
 
174
  ## Limitations