Scalable MatMul-free Language Modeling
Paper
•
2406.02528
•
Published
•
11
Ternary (1.58-bit) language model with weights constrained to {-1, 0, +1}. Trained on 2x NVIDIA H200 GPUs.
| Metric | Value |
|---|---|
| Architecture | FlashLM v4 "Bolt" |
| Parameters | 16.8M (ternary) |
| Hidden dim | 384 |
| Blocks | 8 |
| GLU hidden | 1024 |
| Seq length | 512 |
| Vocab size | 10,000 |
| Dataset | TinyStories (~474M tokens) |
| Tokens seen | 2.16B (~4.5 epochs) |
| Best val loss | 1.675 |
| Training time | ~1.5 hours total (2x H200) |
| Speed | ~600K tok/s |
| Model | Val Loss |
|---|---|
| FlashLM v4-small (4.3M, CPU) | 2.10 |
| FlashLM v4-large (16.8M, GPU) | 1.675 |
flashlm_v4_best.pt - Best model weights (state_dict only)flashlm_v4_final.pt - Final checkpoint (weights + optimizer + metadata)flashlm_v4_step_10000.pt - Mid-training checkpointtokenizer.json - Vocabulary mappingtrain.py - Training script (modified for multi-GPU)Based on FlashLM by Cheng Chang. Inspired by BitNet b1.58 and MatMul-free LM.