changcheng967
/

flashlm-v3-13m

ternary-weights

small-language-model

Model card Files Files and versions

A newer version of this model is available: changcheng967/flashlm-v5.2-nova-ignition

FlashLM v3 (13M parameters)

A small, CPU-trained language model using a novel MatMul-free architecture with ternary weights.

Architecture

Type: ConvMixer + TernaryGLU with shared recursive blocks
Parameters: 13.6M total
d_model: 256
Recursions: 2 (shared weights)
Token mixer: Causal dilated Conv1D (3 layers, dilations 1/4/64)
FFN: TernaryGLU (expansion 2.67x, ReLU squared activation)
Embeddings: GPT-2 SVD-projected to 256 dimensions
Tokenizer: GPT-2 (50,257 vocab)

Training

Dataset: FineWeb-Edu (sample-10BT), 32M tokens from 30k documents
Hardware: CPU only (Deepnote, 2 threads)
Training time: ~1.2 hours
Steps: 4,050 (phases: seq 64 to 128 to 256)
Optimizer: NorMuon (2D weights) + AdamW (embeddings, biases)
LR schedule: Warmup-Stable-Decay (WSD)
Best validation loss: 6.80 (cross-entropy)

Key Features

Ternary weights (-1, 0, +1) in Conv and GLU layers via Straight-Through Estimator
MatMul-free token mixing using depthwise causal convolutions
Position subsampling (25%) during training for speed
Pretrained GPT-2 embeddings projected via SVD

Limitations

Small model (13M params) with limited generation quality
Trained for only 4k steps on 32M tokens
No attention mechanism — relies on causal convolutions for context
Best suited as a research prototype demonstrating CPU-trainable architectures

License

MIT

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using changcheng967/flashlm-v3-13m 1