Kotodama 108M Instruct

A 108M parameter instruction-tuned transformer trained with full-parameter SFT on freeform dialogue data. This is not an assistant-shaped model -- the SFT objective is learning turn structure (ChatML delimiters, turn-taking cadence) rather than instruction-following or helpfulness optimization.

Two instruct variants are provided, corresponding to SFT on each of the two base model checkpoints:

Variant File Base Eval Loss im_end@1 Overfit Gap
Fullcorpus Instruct fc-instruct.pt fullcorpus-ddv1 step 81252 2.401 0.446 0.070
Books-CPT Instruct bcpt-instruct.pt books-cpt step 17336 2.517 0.397 0.102

Both were trained with identical SFT data and hyperparameters. The fullcorpus variant wins on absolute metrics; the books-CPT variant exhibits superior gradient properties during training.

Base Model

Both variants build on kotodama-108m-base, a from-scratch Llama-family transformer trained with the Muon optimizer and Block Attention Residuals (AttnRes).

Proxy architecture (108M):

Parameter Value
d_model 512
n_layers 28
Query heads 4
KV heads 2 (GQA 2:1)
head_dim 128
FFN intermediate 1408 (SwiGLU)
Vocab size 49152 (SmolLM2 tokenizer)
Max position 4096 (RoPE, theta=500K)
Normalization RMSNorm + QK-norm
Tied embeddings Yes
Bias None
z-loss 1e-5
AttnRes DD-v1, boundaries [0, 3, 7, 12, 21, 25]

The fullcorpus base was pretrained on 170.4B tokens (13 sources, academic/code-reasoning/math/legal/books/conversation). The books-CPT variant continued pretraining on 36.4B tokens of public domain books (Common Pile: Internet Archive, Library of Congress, DOAB).

SFT Data

8.1M total tokens (5.5M trainable, 68.4% trainable ratio), 6,187 conversations split 90/10 train/eval. Pretokenized with ChatML template, chunked at turn boundaries to fit 4096 seq_len, packed with first-fit-decreasing bin packing into 1,976 fixed-length bins.

Source Conversations Est. Tokens Description
Infinite Backrooms 821 ~5.7M Model-to-model freeform dialogue between Claude instances
OASST2 top-ranked 5,366 ~2.8M Human multi-turn conversations (rank==0, English only)

Data philosophy. The SFT corpus is deliberately composed of freeform dialogue rather than instruction-following data. Infinite Backrooms conversations (scraped from dreams-of-an-electric-mind.webflow.io) capture two Claude instances in unstructured, extended conversation across 19 scenario types and multiple model generations (Opus 3, Sonnet 3.5, Opus 4, Sonnet 4, Sonnet 4.5). OASST2 contributes genuine human conversational patterns via the highest-ranked response path through each conversation tree.

Explicitly excluded: Alpaca, SlimOrca, UltraChat (too assistant-shaped), ShareGPT/WildChat (noisy, refusal artifacts), SODA/SmolTalk (already in pretraining data).

Processing details:

  • Backrooms: actor names discovered dynamically per conversation; first speaker mapped to user, second to assistant. OOC preamble stripped, ANSI escape codes stripped, conversations with fewer than 3 turns dropped.
  • OASST2: re-extracted from HuggingFace raw data (not the curation pipeline output, which lost rank metadata). Follows rank==0 path at each tree branch.
  • Chunking: 571 conversations exceeded 4096 tokens and were split at turn boundaries into 1,705 non-overlapping chunks. Only 1/6,703 examples was truncated after chunking.

Training

Hyperparameter Sweep

A 18-config sweep was run across both base models: 12 configs for fullcorpus (4 learning rates x 3 epoch counts) and 6 for books-CPT (3 learning rates x 2 epoch counts). All configs used flat LR schedule (warmup 5%, wsd_decay_start: 1.0 -- no decay phase).

Winner for both bases: Muon lr=3e-3 (AdamW lr=3e-4), 2 epochs.

Selection criteria: lowest eval loss with overfit ratio below 1.05 and highest im_end@1 among non-overfitting configs.

Winning Config

# Shared across both variants
muon_lr: 0.003
adamw_lr: 0.0003
num_epochs: 2
batch_size: 4           # per GPU
gradient_accumulation: 1
max_seq_len: 4096
bf16: true
max_grad_norm: 1.0
warmup_ratio: 0.05
wsd_decay_start: 1.0    # flat LR, no decay
muon_momentum: 0.95
muon_weight_decay: 0.01
muon_ns_iterations: 5
muon_ns_coefficients: gram_ns
adamw_betas: [0.9, 0.95]
adamw_weight_decay: 0.1
packed: true             # FFD bin-packed with block-diagonal SDPA masks
attn_res: true
attn_res_boundaries: [0, 3, 7, 12, 21, 25]

AttnRes routing weights were frozen during SFT -- only the base model parameters were updated. The Muon optimizer handles all 2D weight matrices; AdamW handles embeddings, layer norms, and AttnRes parameters.

Hardware

  • 8x GPU (B200), single node
  • HuggingFace Trainer (not DDP torchrun)
  • ~90K tokens/sec throughput
  • ~98 GiB GPU memory allocated
  • Async checkpoints with SHM staging

Variants

The 2x2 comparison (2 base checkpoints x SFT) reveals a clear tradeoff:

Fullcorpus Instruct (fc-instruct.pt)

  • Lower eval loss (2.401 vs 2.517) -- 4.6% advantage
  • Higher im_end@1 (0.446 vs 0.397) -- better turn boundary prediction
  • Less overfit (gap 0.070 vs 0.102, ratio 1.030 vs 1.042)
  • Recommended as the primary instruct variant

Books-CPT Instruct (bcpt-instruct.pt)

  • Substantially lower gradient norm variance -- more stable and uniform gradient flow across layers throughout training
  • 1.41x faster eval loss descent (mean slope -0.00321 vs -0.00227) -- learns the SFT objective more efficiently per step
  • Higher absolute loss reflects the books-CPT base starting from a different loss surface (books domain shift), not SFT quality
  • The gradient uniformity advantage from books continued pretraining survives SFT intact

Evaluation

Fullcorpus Instruct β€” Training Trajectory

Step Eval Loss Eval PPL Overfit Gap Overfit Ratio im_end@1 im_end@5
25 2.557 12.90 -0.158 0.942 0.092 0.337
50 2.477 11.91 -0.122 0.953 0.337 0.538
75 2.433 11.39 0.057 1.024 0.370 0.543
100 2.401 11.04 0.070 1.030 0.386 0.543
110 -- -- -- -- 0.446 0.554
120 -- -- -- -- 0.424 0.560

Best eval loss: 2.401 at step 100. Best im_end@1: 0.446 at step 110.

Books-CPT Instruct β€” Training Trajectory

Step Eval Loss Eval PPL Overfit Gap Overfit Ratio im_end@1 im_end@5
25 2.737 15.44 -0.126 0.956 0.033 0.245
50 2.625 13.80 -0.076 0.972 0.304 0.500
75 2.561 12.95 0.091 1.037 0.348 0.522
100 2.517 12.39 0.102 1.042 0.359 0.543
110 -- -- -- -- 0.391 0.560
120 -- -- -- -- 0.397 0.576

Best eval loss: 2.517 at step 100. Best im_end@1: 0.397 at step 120.

Metric Definitions

  • im_end@1 / im_end@5: Top-1 / top-5 accuracy of predicting the <|im_end|> token at actual turn boundaries in the eval set. Measures whether the model has learned when to stop generating within a turn.
  • im_start@1: Top-1 accuracy for <|im_start|> prediction. Near-zero for both variants (0.0 in most checkpoints), indicating the model has not learned to predict turn-start tokens -- expected given the small SFT corpus and the fact that turn starts are mostly predictable from context.
  • Overfit gap: train_loss - eval_loss. Positive values indicate overfitting.
  • Overfit ratio: train_loss / eval_loss. Values above 1.0 indicate overfitting.

SFT Sweep Results

Fullcorpus Base (12 configs)

Config LR (Muon) Epochs Steps Eval Loss im_end@1 Overfit Ratio
lr1e-02-ep1 0.01 1 62 2.409 0.429 0.971
lr1e-02-ep2 0.01 2 124 2.358 0.418 1.121
lr1e-02-ep3 0.01 3 186 2.373 0.391 1.292
lr1e-03-ep1 0.001 1 62 2.569 0.168 0.942
lr1e-03-ep2 0.001 2 124 2.496 0.332 0.992
lr1e-03-ep3 0.001 3 186 2.440 0.397 1.023
lr3e-03-ep1 0.003 1 62 2.474 0.364 0.954
lr3e-03-ep2 0.003 2 124 2.401 0.424 1.030
lr3e-03-ep3 0.003 3 186 2.352 0.413 1.094
lr3e-04-ep1 0.0003 1 62 2.679 0.022 0.939
lr3e-04-ep2 0.0003 2 124 2.615 0.071 0.974
lr3e-04-ep3 0.0003 3 186 2.556 0.163 0.992

lr=0.01 achieves the lowest absolute eval loss (2.358 at 2 epochs) but with severe overfitting (ratio 1.12). lr=3e-4 learns too slowly -- im_end@1 barely reaches 0.16 even at 3 epochs. lr=3e-3 at 2 epochs is the Pareto optimum: strong eval loss (2.401) and im_end accuracy (0.424) with controlled overfitting (1.03).

Books-CPT Base (6 configs)

Config LR (Muon) Epochs Steps Eval Loss im_end@1 Overfit Ratio
lr1e-02-ep1 0.01 1 62 2.506 0.424 0.986
lr1e-02-ep2 0.01 2 124 2.426 0.424 1.124
lr1e-03-ep1 0.001 1 62 2.758 0.076 0.964
lr1e-03-ep2 0.001 2 124 2.657 0.293 1.010
lr3e-03-ep1 0.003 1 62 2.620 0.353 0.972
lr3e-03-ep2 0.003 2 124 2.517 0.397 1.042

Same pattern as fullcorpus: lr=3e-3 at 2 epochs is the best balance. lr=0.01 overfits aggressively by epoch 2.

Usage

Loading

import torch
from src.model.llama import LuxiaBaseModel, LuxiaModelConfig

# Build config (proxy architecture)
config = LuxiaModelConfig(
    hidden_size=512,
    num_layers=28,
    num_attention_heads=4,
    num_kv_heads=2,
    head_dim=128,
    intermediate_size=1408,
    vocab_size=49152,
    max_position_embeddings=4096,
    rope_theta=500000.0,
    tie_word_embeddings=True,
    attn_res=True,
    attn_res_boundaries=[0, 3, 7, 12, 21, 25],
)

model = LuxiaBaseModel(config)
state_dict = torch.load("fc-instruct.pt", map_location="cpu")
model.load_state_dict(state_dict["model"])
model.eval()

Chat Template (ChatML)

<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant

The model uses SmolLM2's tokenizer (HuggingFaceTB/SmolLM2-135M) with ChatML special tokens:

  • <|im_start|> = token 1
  • <|im_end|> = token 2
  • <|endoftext|> = token 0 (pad token)

Inference Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

messages = [
    {"role": "user", "content": "Tell me about the nature of consciousness."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Autoregressive generation (LuxiaBaseModel has no .generate())
generated = input_ids.to(model.embed_tokens.weight.device)
with torch.no_grad():
    for _ in range(512):
        out = model(input_ids=generated)
        logits = out["logits"][:, -1, :] / 0.8
        probs = torch.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        generated = torch.cat([generated, next_token], dim=1)
        if next_token.item() == 2:  # <|im_end|>
            break

print(tokenizer.decode(generated[0], skip_special_tokens=False))

Sampling note: At 108M scale, avoid top-p sampling -- it catastrophically degrades output quality. Use pure temperature sampling only.

Limitations

  • 108M scale. This is a proxy-scale research model. It demonstrates that the architecture and training pipeline work, but the model's generation quality is fundamentally limited by parameter count. It is not suitable for production use.
  • Not assistant-shaped. The model has learned turn-taking structure but has not been trained to be helpful, harmless, or honest. It may produce incoherent, offensive, or factually incorrect outputs.
  • HF Trainer throughput. The SFT sweep used HuggingFace Trainer rather than the pretraining DDP pipeline. This was a pragmatic choice for sweep automation but means throughput (~90K tok/s) is below what the custom DDP trainer achieves.
  • No geometric probes in sweep. The pretraining pipeline includes geometric monitoring (intrinsic dimension, stable rank, attention entropy). These were not instrumented in the SFT sweep, so we cannot directly measure whether SFT preserves the geometric properties of the base models.
  • im_start accuracy near zero. The model reliably learns to predict turn-end tokens but not turn-start tokens. This is likely a consequence of the small SFT corpus size and the high predictability of turn-start positions from context.
  • Small SFT corpus. At 8.1M tokens (5.5M trainable), the SFT dataset is deliberately minimal. This is sufficient for learning turn structure but not for deep behavioral fine-tuning.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aethera-gp/kotodama-108m-instruct

Finetuned
(1)
this model