FlashLM v3 (13M parameters)
A small, CPU-trained language model using a novel MatMul-free architecture with ternary weights.
Architecture
- Type: ConvMixer + TernaryGLU with shared recursive blocks
- Parameters: 13.6M total
- d_model: 256
- Recursions: 2 (shared weights)
- Token mixer: Causal dilated Conv1D (3 layers, dilations 1/4/64)
- FFN: TernaryGLU (expansion 2.67x, ReLU squared activation)
- Embeddings: GPT-2 SVD-projected to 256 dimensions
- Tokenizer: GPT-2 (50,257 vocab)
Training
- Dataset: FineWeb-Edu (sample-10BT), 32M tokens from 30k documents
- Hardware: CPU only (Deepnote, 2 threads)
- Training time: ~1.2 hours
- Steps: 4,050 (phases: seq 64 to 128 to 256)
- Optimizer: NorMuon (2D weights) + AdamW (embeddings, biases)
- LR schedule: Warmup-Stable-Decay (WSD)
- Best validation loss: 6.80 (cross-entropy)
Key Features
- Ternary weights (-1, 0, +1) in Conv and GLU layers via Straight-Through Estimator
- MatMul-free token mixing using depthwise causal convolutions
- Position subsampling (25%) during training for speed
- Pretrained GPT-2 embeddings projected via SVD
Limitations
- Small model (13M params) with limited generation quality
- Trained for only 4k steps on 32M tokens
- No attention mechanism โ relies on causal convolutions for context
- Best suited as a research prototype demonstrating CPU-trainable architectures
License
MIT
- Downloads last month
- 9
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support