A newer version of this model is available: changcheng967/flashlm-v5.2-nova-ignition

FlashLM v4 "Bolt" — 4.3M Ternary Language Model Trained on CPU in 2 Hours

A 4.3M parameter language model with ternary weights (-1, 0, +1) trained from scratch on a free-tier 2-thread CPU in 2 hours. No GPU used at any point. The model generates coherent children's stories with dialogue and narrative structure.

BPC Evaluation

Bits-per-character (BPC) is a tokenizer-independent metric that fairly compares models with different vocabularies. It measures how well a model compresses raw text: BPC = total_cross_entropy_nats / (total_characters × ln(2)). Lower is better.

	FlashLM v4 "Bolt"	TinyStories-1M
Parameters	4.3M (ternary)	3.7M (float32)
BPC	0.8798	0.6182
Perplexity	15.05	6.72
Training hardware	2-thread CPU (free tier)	V100 GPU
Training time	2 hours	Hours (GPU)
Tokens seen	10.6M	~470M
Architecture	Gated Conv + GLU (no attention)	GPT-Neo (attention)
Weight precision	Ternary (-1, 0, +1)	Float32

FlashLM v4 has only seen 2.3% of the training data that TinyStories-1M used. The validation loss curve was still declining when the 2-hour time limit was reached — this model is undertrained, not underdesigned. Extended training is planned.

Evaluated on 500 TinyStories validation stories (405,081 characters).

What Makes This Interesting

This model was trained from scratch on a free-tier 2-thread CPU notebook (Deepnote) with 5 GB of RAM. No GPU was used — not for pretraining, not for fine-tuning, not for inference.

The entire model fits in 16.7 MB and uses only addition and subtraction for its core operations during inference. Every weight in the model body is constrained to one of three values: -1, 0, or +1.

Despite seeing only 2.3% of the training data, the model generates coherent children's stories with dialogue, character interactions, and basic narrative structure.

Evolution from v3

FlashLM v4 is a complete redesign from v3. The two versions are not directly comparable by validation loss because they were trained on different datasets with different vocabularies.

	FlashLM v3	FlashLM v4 "Bolt"
Parameters	13.6M	4.3M
Vocab size	50,257 (full GPT-2)	10,000 (top 10K)
Dataset	FineWeb-Edu	TinyStories
Tokens seen	32M	10.6M
Training time	1.2 hours	2.0 hours
Hardware	2-thread CPU	2-thread CPU
Token mixer	Custom ternary layers	Gated causal convolution
Channel mixer	None	Ternary GLU (SiLU)
Normalization	LayerNorm	RMSNorm
Output head	Full 50,257 projection (86% of compute)	Weight-tied 10K projection
Output quality	Incoherent / random tokens	Coherent children's stories with dialogue

What went wrong with v3

v3's biggest bottleneck was the output projection layer. With a 50,257-token vocabulary and d_model=256, the softmax head consumed 86% of all training compute, leaving the ternary model core starved. Training on FineWeb-Edu (a broad web corpus) with a tiny model made things worse — the data was too diverse for a 13.6M-param model to learn anything coherent.

What v4 changed

v4 attacked every root cause: shrink the vocabulary from 50K to 10K (eliminating the softmax bottleneck), switch to TinyStories (a focused corpus proven to work at small scale), replace the token mixer with gated causal convolutions, add a proper GLU channel mixer, and use weight-tied embeddings. The result: a model 3x smaller that produces coherent text instead of gibberish.

Architecture

FlashLM v4 "Bolt" is a non-transformer sequential language model that replaces attention with gated causal convolutions and uses ternary (1.58-bit) weights throughout.

Overview

Input Token IDs
      │
      ▼
┌─────────────┐
│  Embedding   │  10,000 × 192 (float, weight-tied with output head)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  BoltBlock   │  × 6 (each block identical structure, independent weights)
│              │
│  ┌─────────┐ │
│  │ RMSNorm │ │
│  │ GatedConv│ │  ← Ternary causal depthwise conv (kernel=8) + gating
│  │ + Residual│ │
│  ├─────────┤ │
│  │ RMSNorm │ │
│  │ TernaryGLU│ │  ← Ternary gated linear unit (SiLU activation)
│  │ + Residual│ │
│  └─────────┘ │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   RMSNorm    │
│  Output Head │  Weight-tied to embedding (float)
└──────┬──────┘
       │
       ▼
   Logits (10,000 vocab)

Component Details

BitLinear (Ternary Weights)

All linear projections in the model body use ternary quantization with a straight-through estimator:

alpha = mean(|W|)
W_ternary = clamp(round(W / alpha), -1, +1) × alpha

During training, gradients flow through the quantization via the straight-through estimator (STE). During inference, weights can be stored as 2-bit integers and all multiplications become additions, subtractions, or zeros.

GatedConvMixer (Token Mixing)

Instead of self-attention, FlashLM v4 uses a gated causal depthwise convolution:

gv = BitLinear_up(x)           # project to 2*dim
gate, val = split(gv)          # split into gate and value
gate = sigmoid(gate)
h = CausalDepthwiseConv1D(val, kernel_size=8)
output = BitLinear_down(h * gate)

This gives the model a receptive field of 8 tokens per layer, or 48 tokens across all 6 layers. The operation is O(T) in sequence length with no quadratic attention cost.

TernaryGLU (Channel Mixing)

The feed-forward network uses a gated linear unit with ternary weights:

output = BitLinear_down(SiLU(BitLinear_gate(x)) * BitLinear_up(x))

Dimensions: 192 → 512 → 192 (expansion ratio ≈ 2.67x).

RMSNorm

Root Mean Square Layer Normalization (no bias, no mean subtraction) before every sub-layer.

Weight-Tied Embedding

The input embedding and output projection share the same weight matrix (float32, 10,000 × 192). This is the only float component in the model; everything else is ternary.

Training Configuration

Setting	Value
Dataset	TinyStories (streamed from HuggingFace)
Tokenizer	GPT-2 (tiktoken), top 10K tokens
Vocab size	10,000
Hidden dim	192
Blocks	6
Conv kernel	8
Sequence length	256
Batch size	8
Optimizer	AdamW (weight decay 0.01)
Peak learning rate	4e-3
LR schedule	Cosine with 200-step warmup
Gradient clipping	1.0
Total steps	5,199
Total tokens	10,647,552
Wall-clock time	7,200 seconds (2.0 hours)
Hardware	2 vCPU, 5 GB RAM (Deepnote free tier)
Training speed	1,479 tokens/sec avg
Best val loss	2.0976 (step 5000)

Training Loss Curve

Step	Val Loss
500	2.8442
1000	2.5828
1500	2.4390
2000	2.2577
2500	2.1870
3000	2.1345
3500	2.0808
4000	2.1490
4500	2.1538
5000	2.0976

The loss curve shows no sign of plateauing. Extended training on more powerful CPU hardware is expected to push BPC significantly lower.

Inference

At inference time, the ternary weights mean the core model operations reduce to:

Weight = +1: add the activation
Weight = -1: subtract the activation
Weight = 0: skip (no operation)

No floating-point multiplications are needed in the model body. The only float operations are the embedding lookup, RMSNorm scaling, and the tied output projection.

Sample Generations

From the final checkpoint (step 5199):

Prompt: "Once upon a time"

Once upon a time, there was a little girl named []. She loved to play outside and explore the world. One day, she wanted to go outside. She went to the [] and saw a big tree. She wanted to catch it, but the [] was too small.

Prompt: "[] and his mom went to the"

[] and his mom went to the []. They had lots of fun and [] each other. And they never gave up. Once upon a time, there was a little girl called []. She loved to explore and find new things.

([] indicates UNK tokens from the 10K vocab limitation — fixable by expanding vocabulary)

Limitations

Vocabulary coverage: The 10K token vocabulary doesn't cover all words in TinyStories. Some outputs contain UNK tokens. Expanding to 15-20K tokens would fix this.
Training data: Only 10.6M of TinyStories' 470M tokens were seen. The model is significantly undertrained.
Receptive field: The 48-token effective receptive field (8 kernel × 6 layers) limits long-range coherence compared to attention-based models.
Domain: Trained exclusively on children's stories. Not a general-purpose language model.

What's Next

Extended training on a Ryzen 7950X3D (16 cores, 96MB V-Cache, 128GB RAM) for multiple days
Scaled variant (~15M params, d=384, 8 blocks) with fixed tokenizer (frequency-based vocab)
Standalone training script release (MIT license)
Target: close the BPC gap with TinyStories-1M through longer training and proper vocab coverage

Previous Versions

FlashLM v3 (13.6M) — ternary weights on FineWeb-Edu, val loss 6.80, incoherent output
FlashLM v3 Demo — interactive demo of v3

Citation

@misc{flashlm-v4-bolt,
  title={FlashLM v4 "Bolt": A Ternary Language Model Trained on CPU},
  author={Cheng Chang},
  year={2026},
  url={https://huggingface.co/changcheng967/flashlm-v4-bolt}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

changcheng967
/

flashlm-v4-bolt