Kotodama 108M Base

A 108M parameter decoder-only transformer trained as a proxy model for validating architectural and optimizer choices before scaling to 3B parameters. This is a research artifact, not a production model.

The model combines three techniques not previously studied together at this scale:

  • Block Attention Residuals (AttnRes) -- learned residual connections across transformer blocks that prevent BOS-sink attention collapse and produce 4x gradient uniformity across depth.
  • NCA pre-pretraining -- bootstrapping attention circuits using Neural Cellular Automata trajectories before language training, which trains attention patterns (not MLPs) and creates an L14 attractor basin in the representation manifold.
  • Muon optimizer -- spectral-norm steepest descent via Newton-Schulz orthogonalization, producing 2-4x higher stable rank than AdamW at matched loss, with Gram-NS optimized coefficients.

Organization: aethera-gp Training code: github.com/LuxiaSL/kotodama

Architecture

The model uses a Llama-family architecture with QK-norm and Block Attention Residuals.

Parameter Value
Parameters 107.8M (+ 58.4K AttnRes)
Hidden size 512
Layers 28
Query heads 4
KV heads 2 (GQA ratio 2:1)
Head dim 128
Intermediate size (SwiGLU) 1408
Vocabulary 49,152 (SmolLM2 tokenizer)
Max context 4,096 tokens
Positional encoding RoPE (theta=500,000)
Normalization Pre-RMSNorm + QK-norm
Embeddings Tied input/output
Bias None
z-loss 1e-5
AttnRes block boundaries [0, 3, 7, 12, 21, 25] (DD-v1)

Block Attention Residuals (DD-v1)

AttnRes adds per-layer learned pseudo-queries and key norms that create residual connections between block boundaries. The DD-v1 configuration divides the 28-layer network into 6 variable-size blocks at layers [0, 3, 7, 12, 21, 25]. This adds only 58.4K parameters (0.05% overhead) but has substantial effects on training dynamics.

Each transformer block stores:

  • attn_res_query / attn_res_norm: attention sub-block residual
  • mlp_res_query / mlp_res_norm: MLP sub-block residual

A final final_res_query / final_res_norm aggregates block outputs before the LM head.

Differences from stock Llama

  • QK-norm: RMSNorm on Q and K projections after linear projection, enabling higher learning rates
  • z-loss: LSE-squared regularization preventing logit explosion
  • Smaller vocab (49K vs 128K): reduces the Godey gradient bottleneck (~94% destruction at 3072/49K vs ~98% at 3072/128K for the 3B target)
  • Block AttnRes: cross-block residual connections (see above)

Training

Optimizer Configuration

Hybrid Muon + AdamW: Muon handles 2D weight matrices (Q/K/V/O projections, FFN gate/up/down -- ~77% of parameters), AdamW handles everything else (embeddings, norms).

Parameter Muon (2D weights) AdamW (embeddings, norms)
Learning rate 0.02 6e-4
Momentum / betas 0.95 (Nesterov) (0.9, 0.95)
Weight decay 0.01 0.1
NS iterations 5 (Gram-NS coefficients) --

Schedule: WSD (Warmup-Stable-Decay). 5,000 step warmup (~6%), stable plateau to 90% of training, cosine decay over final 10%.

Gradient clipping: 1.0

Precision: BF16 autocast with FP8 compute (FP32 optimizer states).

NCA Pre-Pretraining

Before language training, attention weights were bootstrapped using NCA (Neural Cellular Automata) pre-pretraining following Han et al. (2026). An NCA checkpoint co-trained with AttnRes DD-v1 (seed-17, 852M tokens) was used as initialization. After NCA, embeddings were reinitialized to the language vocabulary while attention weights, MLPs, and norms from NCA training were preserved (embed-only reinit).

Data Mix (Fullcorpus)

170.4B tokens from 13 sources, shuffled with seed 42, sequence length 4096.

Source Tokens % Category
peS2o 60.7B 35.6% Academic papers (Semantic Scholar)
OpenCoderReasoning 35.7B 21.0% Code reasoning (R1 + QwQ, Python/C++)
Pile of Law 18.8B 11.0% Legal (court opinions, congressional)
StackExchange 15.7B 9.2% Q&A (22 high-value sites)
OpenWebMath 14.1B 8.2% Math web pages
FineMath 10.8B 6.4% Quality-scored math (4+ score)
PG-19 7.5B 4.4% Books (Project Gutenberg, 71K)
Wikipedia 5.0B 3.0% English Wikipedia
SmolTalk 0.9B 0.6% Synthetic multi-turn dialogue
WildChat 0.5B 0.3% Real user-GPT conversations
SODA 0.3B 0.2% Synthetic social dialogue
Enron 0.3B 0.2% Corporate email
OASST2 0.01B <0.1% Human multi-turn conversations

Category breakdown: Academic/knowledge 38.6%, code reasoning 21.0%, math 14.6%, legal 11.0%, Q&A 9.2%, books 4.4%, conversation 1.1%.

Hardware and Compute

  • Hardware: 8x NVIDIA B200 (single node, NVLink)
  • Parallelism: DDP (DistributedDataParallel)
  • Throughput: ~1.96M tokens/sec average
  • Micro batch size: 16 per GPU
  • Global batch size: 2,097,152 tokens (16 * 4096 * 8 GPUs * gradient accumulation)
  • torch.compile: enabled (4x throughput vs eager)

Model Variants

This repository contains two checkpoints from the same model lineage:

fc-base (fullcorpus)

File: fc-base.pt.zst

The primary pretraining run. 170.4B tokens over 81,252 steps on the full 13-source data mix described above. Initialized from NCA+AttnRes checkpoint (seed-17, 852M NCA tokens). WSD schedule with cosine decay in the final 10%.

Metric Value
Final loss 2.081
Min loss 1.982 (step 80,200)
Final perplexity 8.01
Tokens seen 170.4B
Tokens/param ratio ~1,581x

bcpt-base (books-CPT)

File: bcpt-base.pt.zst

Continued pretraining of the fullcorpus model on 36.2B tokens of book data from three Common Pile sources not present in the original data mix. Resumed from fullcorpus step 72,000 (pre-decay, 151B tokens seen) with fresh optimizer state and a new WSD schedule (500-step warmup, 90% stable, 10% cosine decay).

Source Tokens %
Pre-1929 Books (Internet Archive/HathiTrust) 19.1B 52.8%
Library of Congress 14.0B 38.7%
DOAB (Open Access Books) 3.1B 8.6%

OCR quality filter applied: documents with >5% garbage characters dropped.

Metric Value
Final loss 2.342
Min loss 2.230 (step 17,260)
Final perplexity 10.40
Additional tokens 36.4B (17,337 steps)
Total tokens seen ~187.4B (resumed from step 72K / 151B tokens)

The higher loss/perplexity relative to fullcorpus reflects the domain shift to OCR book text, not regression. The books-CPT variant trades general benchmark performance for improved performance on literary and long-form text.

Evaluation

LM-Eval Benchmarks

All benchmarks run zero-shot via lm-evaluation-harness.

Benchmark Metric fc-base bcpt-base
ARC-Easy acc 0.455 0.445
ARC-Easy acc_norm 0.387 0.388
BoolQ acc 0.559 0.499
COPA acc 0.590 0.590
HellaSwag acc 0.277 0.280
HellaSwag acc_norm 0.297 0.295
LAMBADA acc 0.281 0.297
LAMBADA ppl 83.3 85.5
PIQA acc 0.577 0.588
PIQA acc_norm 0.569 0.571
SciQ acc 0.783 0.779
SciQ acc_norm 0.700 0.685
WikiText word_ppl 41.76 52.09
WikiText bits/byte 1.007 1.066
Winogrande acc 0.508 0.515

Notes: These are proxy-scale (108M) results. Performance is expected at this scale -- the model was not designed to maximize benchmarks. The books-CPT variant shows slight improvements on commonsense/physical reasoning (PIQA, Winogrande, LAMBADA accuracy) and slight degradation on knowledge-heavy tasks (BoolQ, WikiText perplexity), consistent with the domain shift toward literary text.

Analysis Highlights

The primary value of this model as a research artifact is the geometric monitoring data collected during training. The analysis packages in fc-analysis/ and bcpt-analysis/ contain activation geometry, concept geometry, and full metric histories.

Geometric Health (Final Checkpoint)

Monitored at layers [0, 7, 14, 21, 27] throughout training.

Metric Value Interpretation
RankMe (embedding) 440.5 High effective dimensionality (out of 512)
RankMe rebound ratio 15.9x Strong recovery from early collapse (min 27.7 at step 150)
WeightWatcher alpha 7.71 Within Muon-healthy range (see notes)
TwoNN intrinsic dim 5.76 Representation manifold dimensionality
Dead units 0.0% No dead neurons at any monitored layer

Stable Rank Profiles Across Depth

Stable rank (effective rank of weight matrices) remains high across all layers throughout training, a signature of Muon's balanced spectral updates. Representative values from the final checkpoint (step 81,225):

Layer Q proj K proj O proj Gate proj Down proj
0 18.7 15.7 46.3 127.0 56.8
7 42.5 40.0 87.9 76.8 140.4
14 49.1 41.5 43.1 70.2 125.0
21 39.4 30.0 67.9 62.9 49.2
27 43.8 32.3 115.3 76.2 127.8

Key observations:

  • No low-rank collapse: All weight matrices maintain high stable rank through 170B tokens. Under AdamW, these values would typically be 2-4x lower.
  • Depth utilization: Non-monotonic stable rank profile indicates all layers are actively contributing (not degenerating into near-identity transformations).
  • Zero dead units: No layer shows any dead neurons, even after extreme overtraining (1,581x tokens/parameter).

Attention Entropy Across Depth

Layer Mean Entropy Std Interpretation
0 6.13 0.43 Broad attention (early feature mixing)
7 4.64 0.77 Selective attention with variance
14 5.49 0.41 Moderate selectivity
21 5.68 0.29 Moderate, low variance
27 4.14 0.79 Most selective (prediction heads)

This gradient -- broad at the bottom, selective at the top -- is the healthy pattern. Crucially, the deep layers (L27) maintain diverse attention patterns (std=0.79) rather than collapsing to BOS-sink. In baseline models without AttnRes, layers 21-27 develop 89-90% BOS attention concentration by this training stage.

Anisotropy Profile

Layer Anisotropy
0 0.066
7 0.452
14 0.413
21 0.148
27 0.090

The inverted-U anisotropy profile (low at edges, peaking at middle layers) indicates structured representational geometry rather than isotropy collapse or extreme anisotropy.

AttnRes Effects (from Proxy Phase Ablations)

These findings come from the 5-run optimizer sweep at 6B tokens and the full 170B run:

  • BOS-sink prevention: Baseline models develop 89-90% BOS attention at deep layers by 6B tokens. DD-v1 AttnRes prevents this entirely, maintaining diverse attention patterns at all depths.
  • 4x gradient uniformity: Gradient norm variance across layers is ~4x lower with AttnRes, enabling more uniform learning across depth.
  • Full depth utilization: Without AttnRes, deep layers tend toward near-identity transformations. With AttnRes, stable rank and attention entropy remain diverse at all depths.
  • DD-v2 fragility: Shifting even one block boundary (L12 to L14) produced 12/16 geometric metrics outside the range of all other configurations. Variable-size blocks cascade nonlinearly.

NCA Pre-Pretraining Effects

  • Trains attention, not MLPs: NCA pre-pretraining primarily structures attention weight matrices. MLP weights show minimal structured change, confirming that MLP reinit after NCA is correct.
  • L14 attractor basin: NCA creates a distinctive geometric signature at layer 14 that persists through full language training. This basin is present regardless of AttnRes configuration.
  • Sub-additive with AttnRes: NCA + AttnRes produces only +0.008 nats over the better of either alone, but preserves geometric properties from both techniques everywhere in the network.

Key Findings (Proxy Phase)

  1. Muon lr=0.02 is the Pareto optimum for 108M: matches AdamW final loss while maintaining 2-4x higher stable rank across all weight matrices.
  2. torch.compile is the dominant throughput optimization, providing 4x improvement. Liger kernels without FusedLinearCE hurt compile by 13%.
  3. Extreme overtraining (1,581x tokens/param) does not cause geometric collapse with Muon + AttnRes. Stable rank, attention entropy, and dead unit counts all remain healthy at 170B tokens.
  4. WW alpha healthy range is higher for Muon than AdamW. Alpha values of 7-8 are normal for Muon-trained models; do not apply AdamW-calibrated thresholds (which would flag these as unhealthy).

Usage

The checkpoints are stored as compressed PyTorch state dicts (.pt.zst). To load:

import torch
import zstandard as zstd
import io

# Decompress
with open("fc-base.pt.zst", "rb") as f:
    dctx = zstd.ZstdDecompressor()
    decompressed = dctx.decompress(f.read())

# Load state dict
state_dict = torch.load(io.BytesIO(decompressed), map_location="cpu", weights_only=True)

# Initialize model (requires the kotodama training code)
from src.model.llama import LuxiaBaseModel, LuxiaModelConfig

config = LuxiaModelConfig(
    hidden_size=512,
    num_layers=28,
    num_attention_heads=4,
    num_kv_heads=2,
    head_dim=128,
    intermediate_size=1408,
    vocab_size=49152,
    max_position_embeddings=4096,
    rope_theta=500000.0,
    qk_norm=True,
    tie_word_embeddings=True,
    z_loss_weight=1e-5,
    attn_res=True,
    attn_res_boundaries=[0, 3, 7, 12, 21, 25],
)

model = LuxiaBaseModel(config)
model.load_state_dict(state_dict)

Tokenizer: HuggingFaceTB/SmolLM2-135M (49,152 vocab, byte-fallback).

Repository Contents

fc-base.pt.zst              # Fullcorpus final checkpoint (81,252 steps, 170.4B tokens)
bcpt-base.pt.zst             # Books-CPT checkpoint (17,337 additional steps, 36.4B tokens)
fc-analysis/                 # Fullcorpus analysis package
  activation_geometry/       # Per-layer activation extractions
  concept_geometry/          # Concept-level geometric analysis
  lm_eval/                   # Full lm-evaluation-harness results
  report.html                # Analysis report
bcpt-analysis/               # Books-CPT analysis package (same structure)
fc-metrics.jsonl             # Fullcorpus training metrics (loss, LR, throughput)
fc-geo_metrics.jsonl         # Fullcorpus geometric monitoring (stable rank, entropy, etc.)
bcpt-metrics.jsonl           # Books-CPT training metrics
bcpt-geo_metrics.jsonl       # Books-CPT geometric monitoring

Limitations

  • 108M proxy scale. This model exists to validate architecture and optimizer choices, not to be useful for downstream tasks. Benchmark performance reflects this.
  • No raw code in training data. The 645GB cleaned stack_v1 JSONL (~126B tokens, 130 languages) was never tokenized and is absent from the data mix. The model sees code only through reasoning traces (OpenCoderReasoning) and Q&A (StackExchange).
  • Conversational data < 1.2%. The original spec targeted 25% conversational data. The actual mix is dominated by academic text (35.6%) and code reasoning (21.0%).
  • OCR noise in books-CPT. Despite filtering documents with >5% garbage characters, the books-CPT data (pre-1929 scans, Library of Congress) contains residual OCR artifacts.
  • No deduplication was applied to the books-CPT data (estimated minimal cross-source overlap between digitization projects, but not verified).
  • Eval methodology: Top-p sampling catastrophically degrades generation quality at 108M scale. All evaluation uses pure temperature sampling only.

Citation

@misc{kotodama2026,
  title={Kotodama: Block Attention Residuals and NCA Pre-Pretraining for Transformer Language Models},
  author={Aethera GP},
  year={2026},
  url={https://huggingface.co/aethera-gp/kotodama-108m-base}
}

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aethera-gp/kotodama-108m-base

Finetunes
1 model

Datasets used to train aethera-gp/kotodama-108m-base

Papers for aethera-gp/kotodama-108m-base

Evaluation results