bfuzzy1 commited on
Commit
743f8c2
·
verified ·
1 Parent(s): f6922f1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +15 -13
README.md CHANGED
@@ -22,10 +22,9 @@ that actually holds up for its size, scored on how much it gets per parameter ra
22
  | **Rodan-10M-Base** | pretraining | foundation: commonsense + knowledge |
23
  | Rodan-10M-Chat *(released)* | instruction fold | chat / instruction following |
24
  | Rodan-10M-Reasoning *(released)* | recursive depth + CoT fold + DPO | verifiable math + reasoning |
25
- | Rodan-10M-Latent *(planned)* | latent reasoning | in-head compute, no CoT tokens |
26
 
27
- This card covers the base model only. The chat, reasoning, and latent stages are separate models with their
28
- own repos and cards.
29
 
30
  ## Architecture
31
 
@@ -106,9 +105,9 @@ Intelligence per parameter (board avg vs log-params; the shaded region is above
106
 
107
  ![Intelligence per parameter](intelligence_per_param.png)
108
 
109
- The fit runs over all 54 board models, with a residual σ of 3.07 that matches the board's own. Rodan v6
110
- sits +0.31σ above the size-fit line, ahead of liodon at +0.14 and well clear of the per-param
111
- underachievers like GPT-2 (124M, far below). It does this on roughly 1/50th the tokens of the leading
112
  models, which train on about 25B.
113
 
114
  Training loss and data mix, v6 vs v9:
@@ -130,7 +129,9 @@ capability stages rather than more base pretraining. Unique tokens stay around 0
130
  Zero-shot through lm-eval-harness, with a custom MLX backend for `loglikelihood`. We use acc_norm for the
131
  length-sensitive multiple-choice tasks (HellaSwag, ARC, OpenBookQA) and plain acc otherwise.
132
 
133
- Zero-shot, limit 1000 examples per task. Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.
 
 
134
 
135
  | Task | Metric | Score | Random |
136
  |---|---|---|---|
@@ -148,7 +149,8 @@ Zero-shot, limit 1000 examples per task. Board avg = (HellaSwag + (ARC-E + ARC-C
148
  | CommonsenseQA | acc | 20.7 | 20 |
149
  | **Board avg (÷4)** | | **35.80** | |
150
 
151
- For context, it beats the <10M leader on about 1/65th the tokens:
 
152
 
153
  | Model | Params | Tokens | Board avg (÷4) |
154
  |---|---|---|---|
@@ -158,10 +160,10 @@ For context, it beats the <10M leader on about 1/65th the tokens:
158
 
159
  ![v6 benchmarks](v6_v9_metrics.png)
160
 
161
- v6 lands around rank 22 of 54 and +0.31σ above the size-fit line, ahead of liodon at +0.14. The v9
162
- challenger (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, and about even on
163
- per-param too (~+0.32σ). v9 confirmed the ~11M ceiling and that PLE was dead weight, but since it didn't
164
- move the board, v6 stays the base. From here the work moves to the capability stages.
165
 
166
  What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
167
  (67.5) beats GPT-2-124M, and PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly
@@ -169,7 +171,7 @@ above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks
169
  data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
170
  reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
171
  chance, partly a capacity ceiling at this size and partly loglikelihood length-bias. It's a solid base for
172
- discrimination; the deeper reasoning is the job of the later reasoning and latent stages.
173
 
174
  ## Limitations
175
 
 
22
  | **Rodan-10M-Base** | pretraining | foundation: commonsense + knowledge |
23
  | Rodan-10M-Chat *(released)* | instruction fold | chat / instruction following |
24
  | Rodan-10M-Reasoning *(released)* | recursive depth + CoT fold + DPO | verifiable math + reasoning |
 
25
 
26
+ This card covers the base model only. The chat and reasoning stages are separate models with their own
27
+ repos and cards.
28
 
29
  ## Architecture
30
 
 
105
 
106
  ![Intelligence per parameter](intelligence_per_param.png)
107
 
108
+ The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
109
+ sits roughly +0.3σ above the size-fit line — above-trend per parameter, ahead of liodon, and well clear of
110
+ the per-param underachievers lower in the field. It does this on roughly 1/65th the tokens of the leading
111
  models, which train on about 25B.
112
 
113
  Training loss and data mix, v6 vs v9:
 
129
  Zero-shot through lm-eval-harness, with a custom MLX backend for `loglikelihood`. We use acc_norm for the
130
  length-sensitive multiple-choice tasks (HellaSwag, ARC, OpenBookQA) and plain acc otherwise.
131
 
132
+ "The board" throughout is the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard)
133
+ (AxiomicLabs, sub-150M tier). Zero-shot, limit 1000 examples per task.
134
+ Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.
135
 
136
  | Task | Metric | Score | Random |
137
  |---|---|---|---|
 
149
  | CommonsenseQA | acc | 20.7 | 20 |
150
  | **Board avg (÷4)** | | **35.80** | |
151
 
152
+ For context — at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
153
+ 1/65th the tokens:
154
 
155
  | Model | Params | Tokens | Board avg (÷4) |
156
  |---|---|---|---|
 
160
 
161
  ![v6 benchmarks](v6_v9_metrics.png)
162
 
163
+ v6 sits above the size-fit line (~+0.3σ) above-trend per parameter, ahead of liodon. The v9 challenger
164
+ (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
165
+ v9 confirmed the ~11M ceiling and that PLE was dead weight, but since it didn't move the board, v6 stays the
166
+ base. From here the work moves to the capability stages (chat, reasoning).
167
 
168
  What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
169
  (67.5) beats GPT-2-124M, and PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly
 
171
  data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
172
  reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
173
  chance, partly a capacity ceiling at this size and partly loglikelihood length-bias. It's a solid base for
174
+ discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.
175
 
176
  ## Limitations
177