bfuzzy1
/

Rodan-Base

@@ -22,10 +22,9 @@ that actually holds up for its size, scored on how much it gets per parameter ra
 | **Rodan-10M-Base** | pretraining | foundation: commonsense + knowledge |
 | Rodan-10M-Chat *(released)* | instruction fold | chat / instruction following |
 | Rodan-10M-Reasoning *(released)* | recursive depth + CoT fold + DPO | verifiable math + reasoning |
-| Rodan-10M-Latent *(planned)* | latent reasoning | in-head compute, no CoT tokens |
-This card covers the base model only. The chat, reasoning, and latent stages are separate models with their
-own repos and cards.
 ## Architecture
@@ -106,9 +105,9 @@ Intelligence per parameter (board avg vs log-params; the shaded region is above
 ![Intelligence per parameter](intelligence_per_param.png)
-The fit runs over all 54 board models, with a residual σ of 3.07 that matches the board's own. Rodan v6
-sits +0.31σ above the size-fit line, ahead of liodon at +0.14 and well clear of the per-param
-underachievers like GPT-2 (124M, far below). It does this on roughly 1/50th the tokens of the leading
 models, which train on about 25B.
 Training loss and data mix, v6 vs v9:
@@ -130,7 +129,9 @@ capability stages rather than more base pretraining. Unique tokens stay around 0
 Zero-shot through lm-eval-harness, with a custom MLX backend for `loglikelihood`. We use acc_norm for the
 length-sensitive multiple-choice tasks (HellaSwag, ARC, OpenBookQA) and plain acc otherwise.
-Zero-shot, limit 1000 examples per task. Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.
 | Task | Metric | Score | Random |
 |---|---|---|---|
@@ -148,7 +149,8 @@ Zero-shot, limit 1000 examples per task. Board avg = (HellaSwag + (ARC-E + ARC-C
 | CommonsenseQA | acc | 20.7 | 20 |
 | **Board avg (÷4)** | | **35.80** | |
-For context, it beats the <10M leader on about 1/65th the tokens:
 | Model | Params | Tokens | Board avg (÷4) |
 |---|---|---|---|
@@ -158,10 +160,10 @@ For context, it beats the <10M leader on about 1/65th the tokens:
 ![v6 benchmarks](v6_v9_metrics.png)
-v6 lands around rank 22 of 54 and +0.31σ above the size-fit line, ahead of liodon at +0.14. The v9
-challenger (PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, and about even on
-per-param too (~+0.32σ). v9 confirmed the ~11M ceiling and that PLE was dead weight, but since it didn't
-move the board, v6 stays the base. From here the work moves to the capability stages.
 What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
 (67.5) beats GPT-2-124M, and PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly
@@ -169,7 +171,7 @@ above random. Arithmetic has crept off the random floor (ArithMark 26.4) thanks
 data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
 reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
 chance, partly a capacity ceiling at this size and partly loglikelihood length-bias. It's a solid base for
-discrimination; the deeper reasoning is the job of the later reasoning and latent stages.
 ## Limitations

 | **Rodan-10M-Base** | pretraining | foundation: commonsense + knowledge |
 | Rodan-10M-Chat *(released)* | instruction fold | chat / instruction following |
 | Rodan-10M-Reasoning *(released)* | recursive depth + CoT fold + DPO | verifiable math + reasoning |
+This card covers the base model only. The chat and reasoning stages are separate models with their own
+repos and cards.
 ## Architecture
 ![Intelligence per parameter](intelligence_per_param.png)
+The fit runs over the board models, with a residual σ of about 3.07 that matches the board's own. Rodan v6
+sits roughly +0.3σ above the size-fit line — above-trend per parameter, ahead of liodon, and well clear of
+the per-param underachievers lower in the field. It does this on roughly 1/65th the tokens of the leading
 models, which train on about 25B.
 Training loss and data mix, v6 vs v9:
 Zero-shot through lm-eval-harness, with a custom MLX backend for `loglikelihood`. We use acc_norm for the
 length-sensitive multiple-choice tasks (HellaSwag, ARC, OpenBookQA) and plain acc otherwise.
+"The board" throughout is the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard)
+(AxiomicLabs, sub-150M tier). Zero-shot, limit 1000 examples per task.
+Board avg = (HellaSwag + (ARC-E + ARC-C)/2 + PIQA + ArithMark) / 4.
 | Task | Metric | Score | Random |
 |---|---|---|---|
 | CommonsenseQA | acc | 20.7 | 20 |
 | **Board avg (÷4)** | | **35.80** | |
+For context — at 11.46M it's just over the 10M line, but it outscores the sub-10M leader (liodon) on about
+1/65th the tokens:
 | Model | Params | Tokens | Board avg (÷4) |
 |---|---|---|---|
 ![v6 benchmarks](v6_v9_metrics.png)
+v6 sits above the size-fit line (~+0.3σ) — above-trend per parameter, ahead of liodon. The v9 challenger
+(PLE-free, 10.41M, pure-curated) tied it: 35.70 board avg at 9% fewer params, about even on per-param too.
+v9 confirmed the ~11M ceiling and that PLE was dead weight, but since it didn't move the board, v6 stays the
+base. From here the work moves to the capability stages (chat, reasoning).
 What the model is actually like: it holds up well for 11M on commonsense and science multiple-choice. SciQ
 (67.5) beats GPT-2-124M, and PIQA (56.0), ARC-Easy (35.6), HellaSwag (31.8), and COPA (55.0) are all clearly
 data, though it's a modest lift and actually generating arithmetic is still weak. On the harder abstract
 reasoning tasks (Winogrande, CommonsenseQA, ARC-Challenge, OpenBookQA) and on open-ended generation it's near
 chance, partly a capacity ceiling at this size and partly loglikelihood length-bias. It's a solid base for
+discrimination; the deeper reasoning is the job of the separate Chat and Reasoning models.
 ## Limitations