skdrx
/

flashlm-v4-large-trained-longer

Model card Files Files and versions

FlashLM v4-Large (Trained Longer)

Ternary (1.58-bit) language model with weights constrained to {-1, 0, +1}. Trained on 2x NVIDIA H200 GPUs.

Training Details

Metric	Value
Architecture	FlashLM v4 "Bolt"
Parameters	16.8M (ternary)
Hidden dim	384
Blocks	8
GLU hidden	1024
Seq length	512
Vocab size	10,000
Dataset	TinyStories (~474M tokens)
Tokens seen	2.16B (~4.5 epochs)
Best val loss	1.675
Training time	~1.5 hours total (2x H200)
Speed	~600K tok/s

Comparison

Model	Val Loss
FlashLM v4-small (4.3M, CPU)	2.10
FlashLM v4-large (16.8M, GPU)	1.675

Files

flashlm_v4_best.pt - Best model weights (state_dict only)
flashlm_v4_final.pt - Final checkpoint (weights + optimizer + metadata)
flashlm_v4_step_10000.pt - Mid-training checkpoint
tokenizer.json - Vocabulary mapping
train.py - Training script (modified for multi-GPU)

Usage

Credits

Based on FlashLM by Cheng Chang. Inspired by BitNet b1.58 and MatMul-free LM.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train skdrx/flashlm-v4-large-trained-longer

Space using skdrx/flashlm-v4-large-trained-longer 1

Papers for skdrx/flashlm-v4-large-trained-longer

Scalable MatMul-free Language Modeling

Paper • 2406.02528 • Published Jun 4, 2024 • 11

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27, 2024 • 628