Kotodama 108M Base
A 108M parameter decoder-only transformer trained as a proxy model for validating architectural and optimizer choices before scaling to 3B parameters. This is a research artifact, not a production model.
The model combines three techniques not previously studied together at this scale:
- Block Attention Residuals (AttnRes) -- learned residual connections across transformer blocks that prevent BOS-sink attention collapse and produce 4x gradient uniformity across depth.
- NCA pre-pretraining -- bootstrapping attention circuits using Neural Cellular Automata trajectories before language training, which trains attention patterns (not MLPs) and creates an L14 attractor basin in the representation manifold.
- Muon optimizer -- spectral-norm steepest descent via Newton-Schulz orthogonalization, producing 2-4x higher stable rank than AdamW at matched loss, with Gram-NS optimized coefficients.
Organization: aethera-gp Training code: github.com/LuxiaSL/kotodama
Architecture
The model uses a Llama-family architecture with QK-norm and Block Attention Residuals.
| Parameter | Value |
|---|---|
| Parameters | 107.8M (+ 58.4K AttnRes) |
| Hidden size | 512 |
| Layers | 28 |
| Query heads | 4 |
| KV heads | 2 (GQA ratio 2:1) |
| Head dim | 128 |
| Intermediate size (SwiGLU) | 1408 |
| Vocabulary | 49,152 (SmolLM2 tokenizer) |
| Max context | 4,096 tokens |
| Positional encoding | RoPE (theta=500,000) |
| Normalization | Pre-RMSNorm + QK-norm |
| Embeddings | Tied input/output |
| Bias | None |
| z-loss | 1e-5 |
| AttnRes block boundaries | [0, 3, 7, 12, 21, 25] (DD-v1) |
Block Attention Residuals (DD-v1)
AttnRes adds per-layer learned pseudo-queries and key norms that create residual connections between block boundaries. The DD-v1 configuration divides the 28-layer network into 6 variable-size blocks at layers [0, 3, 7, 12, 21, 25]. This adds only 58.4K parameters (0.05% overhead) but has substantial effects on training dynamics.
Each transformer block stores:
attn_res_query/attn_res_norm: attention sub-block residualmlp_res_query/mlp_res_norm: MLP sub-block residual
A final final_res_query / final_res_norm aggregates block outputs before the LM head.
Differences from stock Llama
- QK-norm: RMSNorm on Q and K projections after linear projection, enabling higher learning rates
- z-loss: LSE-squared regularization preventing logit explosion
- Smaller vocab (49K vs 128K): reduces the Godey gradient bottleneck (~94% destruction at 3072/49K vs ~98% at 3072/128K for the 3B target)
- Block AttnRes: cross-block residual connections (see above)
Training
Optimizer Configuration
Hybrid Muon + AdamW: Muon handles 2D weight matrices (Q/K/V/O projections, FFN gate/up/down -- ~77% of parameters), AdamW handles everything else (embeddings, norms).
| Parameter | Muon (2D weights) | AdamW (embeddings, norms) |
|---|---|---|
| Learning rate | 0.02 | 6e-4 |
| Momentum / betas | 0.95 (Nesterov) | (0.9, 0.95) |
| Weight decay | 0.01 | 0.1 |
| NS iterations | 5 (Gram-NS coefficients) | -- |
Schedule: WSD (Warmup-Stable-Decay). 5,000 step warmup (~6%), stable plateau to 90% of training, cosine decay over final 10%.
Gradient clipping: 1.0
Precision: BF16 autocast with FP8 compute (FP32 optimizer states).
NCA Pre-Pretraining
Before language training, attention weights were bootstrapped using NCA (Neural Cellular Automata) pre-pretraining following Han et al. (2026). An NCA checkpoint co-trained with AttnRes DD-v1 (seed-17, 852M tokens) was used as initialization. After NCA, embeddings were reinitialized to the language vocabulary while attention weights, MLPs, and norms from NCA training were preserved (embed-only reinit).
Data Mix (Fullcorpus)
170.4B tokens from 13 sources, shuffled with seed 42, sequence length 4096.
| Source | Tokens | % | Category |
|---|---|---|---|
| peS2o | 60.7B | 35.6% | Academic papers (Semantic Scholar) |
| OpenCoderReasoning | 35.7B | 21.0% | Code reasoning (R1 + QwQ, Python/C++) |
| Pile of Law | 18.8B | 11.0% | Legal (court opinions, congressional) |
| StackExchange | 15.7B | 9.2% | Q&A (22 high-value sites) |
| OpenWebMath | 14.1B | 8.2% | Math web pages |
| FineMath | 10.8B | 6.4% | Quality-scored math (4+ score) |
| PG-19 | 7.5B | 4.4% | Books (Project Gutenberg, 71K) |
| Wikipedia | 5.0B | 3.0% | English Wikipedia |
| SmolTalk | 0.9B | 0.6% | Synthetic multi-turn dialogue |
| WildChat | 0.5B | 0.3% | Real user-GPT conversations |
| SODA | 0.3B | 0.2% | Synthetic social dialogue |
| Enron | 0.3B | 0.2% | Corporate email |
| OASST2 | 0.01B | <0.1% | Human multi-turn conversations |
Category breakdown: Academic/knowledge 38.6%, code reasoning 21.0%, math 14.6%, legal 11.0%, Q&A 9.2%, books 4.4%, conversation 1.1%.
Hardware and Compute
- Hardware: 8x NVIDIA B200 (single node, NVLink)
- Parallelism: DDP (DistributedDataParallel)
- Throughput: ~1.96M tokens/sec average
- Micro batch size: 16 per GPU
- Global batch size: 2,097,152 tokens (16 * 4096 * 8 GPUs * gradient accumulation)
- torch.compile: enabled (4x throughput vs eager)
Model Variants
This repository contains two checkpoints from the same model lineage:
fc-base (fullcorpus)
File: fc-base.pt.zst
The primary pretraining run. 170.4B tokens over 81,252 steps on the full 13-source data mix described above. Initialized from NCA+AttnRes checkpoint (seed-17, 852M NCA tokens). WSD schedule with cosine decay in the final 10%.
| Metric | Value |
|---|---|
| Final loss | 2.081 |
| Min loss | 1.982 (step 80,200) |
| Final perplexity | 8.01 |
| Tokens seen | 170.4B |
| Tokens/param ratio | ~1,581x |
bcpt-base (books-CPT)
File: bcpt-base.pt.zst
Continued pretraining of the fullcorpus model on 36.2B tokens of book data from three Common Pile sources not present in the original data mix. Resumed from fullcorpus step 72,000 (pre-decay, 151B tokens seen) with fresh optimizer state and a new WSD schedule (500-step warmup, 90% stable, 10% cosine decay).
| Source | Tokens | % |
|---|---|---|
| Pre-1929 Books (Internet Archive/HathiTrust) | 19.1B | 52.8% |
| Library of Congress | 14.0B | 38.7% |
| DOAB (Open Access Books) | 3.1B | 8.6% |
OCR quality filter applied: documents with >5% garbage characters dropped.
| Metric | Value |
|---|---|
| Final loss | 2.342 |
| Min loss | 2.230 (step 17,260) |
| Final perplexity | 10.40 |
| Additional tokens | 36.4B (17,337 steps) |
| Total tokens seen | ~187.4B (resumed from step 72K / 151B tokens) |
The higher loss/perplexity relative to fullcorpus reflects the domain shift to OCR book text, not regression. The books-CPT variant trades general benchmark performance for improved performance on literary and long-form text.
Evaluation
LM-Eval Benchmarks
All benchmarks run zero-shot via lm-evaluation-harness.
| Benchmark | Metric | fc-base | bcpt-base |
|---|---|---|---|
| ARC-Easy | acc | 0.455 | 0.445 |
| ARC-Easy | acc_norm | 0.387 | 0.388 |
| BoolQ | acc | 0.559 | 0.499 |
| COPA | acc | 0.590 | 0.590 |
| HellaSwag | acc | 0.277 | 0.280 |
| HellaSwag | acc_norm | 0.297 | 0.295 |
| LAMBADA | acc | 0.281 | 0.297 |
| LAMBADA | ppl | 83.3 | 85.5 |
| PIQA | acc | 0.577 | 0.588 |
| PIQA | acc_norm | 0.569 | 0.571 |
| SciQ | acc | 0.783 | 0.779 |
| SciQ | acc_norm | 0.700 | 0.685 |
| WikiText | word_ppl | 41.76 | 52.09 |
| WikiText | bits/byte | 1.007 | 1.066 |
| Winogrande | acc | 0.508 | 0.515 |
Notes: These are proxy-scale (108M) results. Performance is expected at this scale -- the model was not designed to maximize benchmarks. The books-CPT variant shows slight improvements on commonsense/physical reasoning (PIQA, Winogrande, LAMBADA accuracy) and slight degradation on knowledge-heavy tasks (BoolQ, WikiText perplexity), consistent with the domain shift toward literary text.
Analysis Highlights
The primary value of this model as a research artifact is the geometric monitoring data collected during training. The analysis packages in fc-analysis/ and bcpt-analysis/ contain activation geometry, concept geometry, and full metric histories.
Geometric Health (Final Checkpoint)
Monitored at layers [0, 7, 14, 21, 27] throughout training.
| Metric | Value | Interpretation |
|---|---|---|
| RankMe (embedding) | 440.5 | High effective dimensionality (out of 512) |
| RankMe rebound ratio | 15.9x | Strong recovery from early collapse (min 27.7 at step 150) |
| WeightWatcher alpha | 7.71 | Within Muon-healthy range (see notes) |
| TwoNN intrinsic dim | 5.76 | Representation manifold dimensionality |
| Dead units | 0.0% | No dead neurons at any monitored layer |
Stable Rank Profiles Across Depth
Stable rank (effective rank of weight matrices) remains high across all layers throughout training, a signature of Muon's balanced spectral updates. Representative values from the final checkpoint (step 81,225):
| Layer | Q proj | K proj | O proj | Gate proj | Down proj |
|---|---|---|---|---|---|
| 0 | 18.7 | 15.7 | 46.3 | 127.0 | 56.8 |
| 7 | 42.5 | 40.0 | 87.9 | 76.8 | 140.4 |
| 14 | 49.1 | 41.5 | 43.1 | 70.2 | 125.0 |
| 21 | 39.4 | 30.0 | 67.9 | 62.9 | 49.2 |
| 27 | 43.8 | 32.3 | 115.3 | 76.2 | 127.8 |
Key observations:
- No low-rank collapse: All weight matrices maintain high stable rank through 170B tokens. Under AdamW, these values would typically be 2-4x lower.
- Depth utilization: Non-monotonic stable rank profile indicates all layers are actively contributing (not degenerating into near-identity transformations).
- Zero dead units: No layer shows any dead neurons, even after extreme overtraining (1,581x tokens/parameter).
Attention Entropy Across Depth
| Layer | Mean Entropy | Std | Interpretation |
|---|---|---|---|
| 0 | 6.13 | 0.43 | Broad attention (early feature mixing) |
| 7 | 4.64 | 0.77 | Selective attention with variance |
| 14 | 5.49 | 0.41 | Moderate selectivity |
| 21 | 5.68 | 0.29 | Moderate, low variance |
| 27 | 4.14 | 0.79 | Most selective (prediction heads) |
This gradient -- broad at the bottom, selective at the top -- is the healthy pattern. Crucially, the deep layers (L27) maintain diverse attention patterns (std=0.79) rather than collapsing to BOS-sink. In baseline models without AttnRes, layers 21-27 develop 89-90% BOS attention concentration by this training stage.
Anisotropy Profile
| Layer | Anisotropy |
|---|---|
| 0 | 0.066 |
| 7 | 0.452 |
| 14 | 0.413 |
| 21 | 0.148 |
| 27 | 0.090 |
The inverted-U anisotropy profile (low at edges, peaking at middle layers) indicates structured representational geometry rather than isotropy collapse or extreme anisotropy.
AttnRes Effects (from Proxy Phase Ablations)
These findings come from the 5-run optimizer sweep at 6B tokens and the full 170B run:
- BOS-sink prevention: Baseline models develop 89-90% BOS attention at deep layers by 6B tokens. DD-v1 AttnRes prevents this entirely, maintaining diverse attention patterns at all depths.
- 4x gradient uniformity: Gradient norm variance across layers is ~4x lower with AttnRes, enabling more uniform learning across depth.
- Full depth utilization: Without AttnRes, deep layers tend toward near-identity transformations. With AttnRes, stable rank and attention entropy remain diverse at all depths.
- DD-v2 fragility: Shifting even one block boundary (L12 to L14) produced 12/16 geometric metrics outside the range of all other configurations. Variable-size blocks cascade nonlinearly.
NCA Pre-Pretraining Effects
- Trains attention, not MLPs: NCA pre-pretraining primarily structures attention weight matrices. MLP weights show minimal structured change, confirming that MLP reinit after NCA is correct.
- L14 attractor basin: NCA creates a distinctive geometric signature at layer 14 that persists through full language training. This basin is present regardless of AttnRes configuration.
- Sub-additive with AttnRes: NCA + AttnRes produces only +0.008 nats over the better of either alone, but preserves geometric properties from both techniques everywhere in the network.
Key Findings (Proxy Phase)
- Muon lr=0.02 is the Pareto optimum for 108M: matches AdamW final loss while maintaining 2-4x higher stable rank across all weight matrices.
- torch.compile is the dominant throughput optimization, providing 4x improvement. Liger kernels without FusedLinearCE hurt compile by 13%.
- Extreme overtraining (1,581x tokens/param) does not cause geometric collapse with Muon + AttnRes. Stable rank, attention entropy, and dead unit counts all remain healthy at 170B tokens.
- WW alpha healthy range is higher for Muon than AdamW. Alpha values of 7-8 are normal for Muon-trained models; do not apply AdamW-calibrated thresholds (which would flag these as unhealthy).
Usage
The checkpoints are stored as compressed PyTorch state dicts (.pt.zst). To load:
import torch
import zstandard as zstd
import io
# Decompress
with open("fc-base.pt.zst", "rb") as f:
dctx = zstd.ZstdDecompressor()
decompressed = dctx.decompress(f.read())
# Load state dict
state_dict = torch.load(io.BytesIO(decompressed), map_location="cpu", weights_only=True)
# Initialize model (requires the kotodama training code)
from src.model.llama import LuxiaBaseModel, LuxiaModelConfig
config = LuxiaModelConfig(
hidden_size=512,
num_layers=28,
num_attention_heads=4,
num_kv_heads=2,
head_dim=128,
intermediate_size=1408,
vocab_size=49152,
max_position_embeddings=4096,
rope_theta=500000.0,
qk_norm=True,
tie_word_embeddings=True,
z_loss_weight=1e-5,
attn_res=True,
attn_res_boundaries=[0, 3, 7, 12, 21, 25],
)
model = LuxiaBaseModel(config)
model.load_state_dict(state_dict)
Tokenizer: HuggingFaceTB/SmolLM2-135M (49,152 vocab, byte-fallback).
Repository Contents
fc-base.pt.zst # Fullcorpus final checkpoint (81,252 steps, 170.4B tokens)
bcpt-base.pt.zst # Books-CPT checkpoint (17,337 additional steps, 36.4B tokens)
fc-analysis/ # Fullcorpus analysis package
activation_geometry/ # Per-layer activation extractions
concept_geometry/ # Concept-level geometric analysis
lm_eval/ # Full lm-evaluation-harness results
report.html # Analysis report
bcpt-analysis/ # Books-CPT analysis package (same structure)
fc-metrics.jsonl # Fullcorpus training metrics (loss, LR, throughput)
fc-geo_metrics.jsonl # Fullcorpus geometric monitoring (stable rank, entropy, etc.)
bcpt-metrics.jsonl # Books-CPT training metrics
bcpt-geo_metrics.jsonl # Books-CPT geometric monitoring
Limitations
- 108M proxy scale. This model exists to validate architecture and optimizer choices, not to be useful for downstream tasks. Benchmark performance reflects this.
- No raw code in training data. The 645GB cleaned stack_v1 JSONL (~126B tokens, 130 languages) was never tokenized and is absent from the data mix. The model sees code only through reasoning traces (OpenCoderReasoning) and Q&A (StackExchange).
- Conversational data < 1.2%. The original spec targeted 25% conversational data. The actual mix is dominated by academic text (35.6%) and code reasoning (21.0%).
- OCR noise in books-CPT. Despite filtering documents with >5% garbage characters, the books-CPT data (pre-1929 scans, Library of Congress) contains residual OCR artifacts.
- No deduplication was applied to the books-CPT data (estimated minimal cross-source overlap between digitization projects, but not verified).
- Eval methodology: Top-p sampling catastrophically degrades generation quality at 108M scale. All evaluation uses pure temperature sampling only.
Citation
@misc{kotodama2026,
title={Kotodama: Block Attention Residuals and NCA Pre-Pretraining for Transformer Language Models},
author={Aethera GP},
year={2026},
url={https://huggingface.co/aethera-gp/kotodama-108m-base}
}
References
- Block Attention Residuals: see
Attention_Residuals.pdfin the training repo - NCA Pre-Pretraining: Han et al., 2026
- Muon Optimizer: MoonshotAI/Muon; Moonlight: Muon is Scalable for LLM Training
- Gram-Newton-Schulz: Dao-AILab/Gram-Newton-Schulz
- WeightWatcher: Martin et al.
Model tree for aethera-gp/kotodama-108m-base
Datasets used to train aethera-gp/kotodama-108m-base
Papers for aethera-gp/kotodama-108m-base
Muon is Scalable for LLM Training
Cognitively Aided Zero-Shot Automatic Essay Grading
Evaluation results
- Word Perplexity (fc-base) on WikiText-2self-reported41.760
- Word Perplexity (bcpt-base) on WikiText-2self-reported52.090
- Accuracy (fc-base) on ARC-Easyself-reported0.455
- Accuracy (bcpt-base) on ARC-Easyself-reported0.445