FlashLM v4-Large (Trained Longer)

Ternary (1.58-bit) language model with weights constrained to {-1, 0, +1}. Trained on 2x NVIDIA H200 GPUs.

Training Details

Metric Value
Architecture FlashLM v4 "Bolt"
Parameters 16.8M (ternary)
Hidden dim 384
Blocks 8
GLU hidden 1024
Seq length 512
Vocab size 10,000
Dataset TinyStories (~474M tokens)
Tokens seen 2.16B (~4.5 epochs)
Best val loss 1.675
Training time ~1.5 hours total (2x H200)
Speed ~600K tok/s

Comparison

Model Val Loss
FlashLM v4-small (4.3M, CPU) 2.10
FlashLM v4-large (16.8M, GPU) 1.675

Files

  • flashlm_v4_best.pt - Best model weights (state_dict only)
  • flashlm_v4_final.pt - Final checkpoint (weights + optimizer + metadata)
  • flashlm_v4_step_10000.pt - Mid-training checkpoint
  • tokenizer.json - Vocabulary mapping
  • train.py - Training script (modified for multi-GPU)

Usage

Credits

Based on FlashLM by Cheng Chang. Inspired by BitNet b1.58 and MatMul-free LM.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train skdrx/flashlm-v4-large-trained-longer

Space using skdrx/flashlm-v4-large-trained-longer 1

Papers for skdrx/flashlm-v4-large-trained-longer