FlashLM v3 (13M parameters)

A small, CPU-trained language model using a novel MatMul-free architecture with ternary weights.

Architecture

  • Type: ConvMixer + TernaryGLU with shared recursive blocks
  • Parameters: 13.6M total
  • d_model: 256
  • Recursions: 2 (shared weights)
  • Token mixer: Causal dilated Conv1D (3 layers, dilations 1/4/64)
  • FFN: TernaryGLU (expansion 2.67x, ReLU squared activation)
  • Embeddings: GPT-2 SVD-projected to 256 dimensions
  • Tokenizer: GPT-2 (50,257 vocab)

Training

  • Dataset: FineWeb-Edu (sample-10BT), 32M tokens from 30k documents
  • Hardware: CPU only (Deepnote, 2 threads)
  • Training time: ~1.2 hours
  • Steps: 4,050 (phases: seq 64 to 128 to 256)
  • Optimizer: NorMuon (2D weights) + AdamW (embeddings, biases)
  • LR schedule: Warmup-Stable-Decay (WSD)
  • Best validation loss: 6.80 (cross-entropy)

Key Features

  • Ternary weights (-1, 0, +1) in Conv and GLU layers via Straight-Through Estimator
  • MatMul-free token mixing using depthwise causal convolutions
  • Position subsampling (25%) during training for speed
  • Pretrained GPT-2 embeddings projected via SVD

Limitations

  • Small model (13M params) with limited generation quality
  • Trained for only 4k steps on 32M tokens
  • No attention mechanism โ€” relies on causal convolutions for context
  • Best suited as a research prototype demonstrating CPU-trainable architectures

License

MIT

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using changcheng967/flashlm-v3-13m 1