FlashLM v4 "Bolt" β 4.3M Ternary Language Model Trained on CPU in 2 Hours
A 4.3M parameter language model with ternary weights (-1, 0, +1) trained from scratch on a free-tier 2-thread CPU in 2 hours. No GPU used at any point. The model generates coherent children's stories with dialogue and narrative structure.
BPC Evaluation
Bits-per-character (BPC) is a tokenizer-independent metric that fairly compares models with different vocabularies. It measures how well a model compresses raw text: BPC = total_cross_entropy_nats / (total_characters Γ ln(2)). Lower is better.
| FlashLM v4 "Bolt" | TinyStories-1M | |
|---|---|---|
| Parameters | 4.3M (ternary) | 3.7M (float32) |
| BPC | 0.8798 | 0.6182 |
| Perplexity | 15.05 | 6.72 |
| Training hardware | 2-thread CPU (free tier) | V100 GPU |
| Training time | 2 hours | Hours (GPU) |
| Tokens seen | 10.6M | ~470M |
| Architecture | Gated Conv + GLU (no attention) | GPT-Neo (attention) |
| Weight precision | Ternary (-1, 0, +1) | Float32 |
FlashLM v4 has only seen 2.3% of the training data that TinyStories-1M used. The validation loss curve was still declining when the 2-hour time limit was reached β this model is undertrained, not underdesigned. Extended training is planned.
Evaluated on 500 TinyStories validation stories (405,081 characters).
What Makes This Interesting
This model was trained from scratch on a free-tier 2-thread CPU notebook (Deepnote) with 5 GB of RAM. No GPU was used β not for pretraining, not for fine-tuning, not for inference.
The entire model fits in 16.7 MB and uses only addition and subtraction for its core operations during inference. Every weight in the model body is constrained to one of three values: -1, 0, or +1.
Despite seeing only 2.3% of the training data, the model generates coherent children's stories with dialogue, character interactions, and basic narrative structure.
Evolution from v3
FlashLM v4 is a complete redesign from v3. The two versions are not directly comparable by validation loss because they were trained on different datasets with different vocabularies.
| FlashLM v3 | FlashLM v4 "Bolt" | |
|---|---|---|
| Parameters | 13.6M | 4.3M |
| Vocab size | 50,257 (full GPT-2) | 10,000 (top 10K) |
| Dataset | FineWeb-Edu | TinyStories |
| Tokens seen | 32M | 10.6M |
| Training time | 1.2 hours | 2.0 hours |
| Hardware | 2-thread CPU | 2-thread CPU |
| Token mixer | Custom ternary layers | Gated causal convolution |
| Channel mixer | None | Ternary GLU (SiLU) |
| Normalization | LayerNorm | RMSNorm |
| Output head | Full 50,257 projection (86% of compute) | Weight-tied 10K projection |
| Output quality | Incoherent / random tokens | Coherent children's stories with dialogue |
What went wrong with v3
v3's biggest bottleneck was the output projection layer. With a 50,257-token vocabulary and d_model=256, the softmax head consumed 86% of all training compute, leaving the ternary model core starved. Training on FineWeb-Edu (a broad web corpus) with a tiny model made things worse β the data was too diverse for a 13.6M-param model to learn anything coherent.
What v4 changed
v4 attacked every root cause: shrink the vocabulary from 50K to 10K (eliminating the softmax bottleneck), switch to TinyStories (a focused corpus proven to work at small scale), replace the token mixer with gated causal convolutions, add a proper GLU channel mixer, and use weight-tied embeddings. The result: a model 3x smaller that produces coherent text instead of gibberish.
Architecture
FlashLM v4 "Bolt" is a non-transformer sequential language model that replaces attention with gated causal convolutions and uses ternary (1.58-bit) weights throughout.
Overview
Input Token IDs
β
βΌ
βββββββββββββββ
β Embedding β 10,000 Γ 192 (float, weight-tied with output head)
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β BoltBlock β Γ 6 (each block identical structure, independent weights)
β β
β βββββββββββ β
β β RMSNorm β β
β β GatedConvβ β β Ternary causal depthwise conv (kernel=8) + gating
β β + Residualβ β
β βββββββββββ€ β
β β RMSNorm β β
β β TernaryGLUβ β β Ternary gated linear unit (SiLU activation)
β β + Residualβ β
β βββββββββββ β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β RMSNorm β
β Output Head β Weight-tied to embedding (float)
ββββββββ¬βββββββ
β
βΌ
Logits (10,000 vocab)
Component Details
BitLinear (Ternary Weights)
All linear projections in the model body use ternary quantization with a straight-through estimator:
alpha = mean(|W|)
W_ternary = clamp(round(W / alpha), -1, +1) Γ alpha
During training, gradients flow through the quantization via the straight-through estimator (STE). During inference, weights can be stored as 2-bit integers and all multiplications become additions, subtractions, or zeros.
GatedConvMixer (Token Mixing)
Instead of self-attention, FlashLM v4 uses a gated causal depthwise convolution:
gv = BitLinear_up(x) # project to 2*dim
gate, val = split(gv) # split into gate and value
gate = sigmoid(gate)
h = CausalDepthwiseConv1D(val, kernel_size=8)
output = BitLinear_down(h * gate)
This gives the model a receptive field of 8 tokens per layer, or 48 tokens across all 6 layers. The operation is O(T) in sequence length with no quadratic attention cost.
TernaryGLU (Channel Mixing)
The feed-forward network uses a gated linear unit with ternary weights:
output = BitLinear_down(SiLU(BitLinear_gate(x)) * BitLinear_up(x))
Dimensions: 192 β 512 β 192 (expansion ratio β 2.67x).
RMSNorm
Root Mean Square Layer Normalization (no bias, no mean subtraction) before every sub-layer.
Weight-Tied Embedding
The input embedding and output projection share the same weight matrix (float32, 10,000 Γ 192). This is the only float component in the model; everything else is ternary.
Training Configuration
| Setting | Value |
|---|---|
| Dataset | TinyStories (streamed from HuggingFace) |
| Tokenizer | GPT-2 (tiktoken), top 10K tokens |
| Vocab size | 10,000 |
| Hidden dim | 192 |
| Blocks | 6 |
| Conv kernel | 8 |
| Sequence length | 256 |
| Batch size | 8 |
| Optimizer | AdamW (weight decay 0.01) |
| Peak learning rate | 4e-3 |
| LR schedule | Cosine with 200-step warmup |
| Gradient clipping | 1.0 |
| Total steps | 5,199 |
| Total tokens | 10,647,552 |
| Wall-clock time | 7,200 seconds (2.0 hours) |
| Hardware | 2 vCPU, 5 GB RAM (Deepnote free tier) |
| Training speed | 1,479 tokens/sec avg |
| Best val loss | 2.0976 (step 5000) |
Training Loss Curve
| Step | Val Loss |
|---|---|
| 500 | 2.8442 |
| 1000 | 2.5828 |
| 1500 | 2.4390 |
| 2000 | 2.2577 |
| 2500 | 2.1870 |
| 3000 | 2.1345 |
| 3500 | 2.0808 |
| 4000 | 2.1490 |
| 4500 | 2.1538 |
| 5000 | 2.0976 |
The loss curve shows no sign of plateauing. Extended training on more powerful CPU hardware is expected to push BPC significantly lower.
Inference
At inference time, the ternary weights mean the core model operations reduce to:
- Weight = +1: add the activation
- Weight = -1: subtract the activation
- Weight = 0: skip (no operation)
No floating-point multiplications are needed in the model body. The only float operations are the embedding lookup, RMSNorm scaling, and the tied output projection.
Sample Generations
From the final checkpoint (step 5199):
Prompt: "Once upon a time"
Once upon a time, there was a little girl named []. She loved to play outside and explore the world. One day, she wanted to go outside. She went to the [] and saw a big tree. She wanted to catch it, but the [] was too small.
Prompt: "[] and his mom went to the"
[] and his mom went to the []. They had lots of fun and [] each other. And they never gave up. Once upon a time, there was a little girl called []. She loved to explore and find new things.
([] indicates UNK tokens from the 10K vocab limitation β fixable by expanding vocabulary)
Limitations
- Vocabulary coverage: The 10K token vocabulary doesn't cover all words in TinyStories. Some outputs contain UNK tokens. Expanding to 15-20K tokens would fix this.
- Training data: Only 10.6M of TinyStories' 470M tokens were seen. The model is significantly undertrained.
- Receptive field: The 48-token effective receptive field (8 kernel Γ 6 layers) limits long-range coherence compared to attention-based models.
- Domain: Trained exclusively on children's stories. Not a general-purpose language model.
What's Next
- Extended training on a Ryzen 7950X3D (16 cores, 96MB V-Cache, 128GB RAM) for multiple days
- Scaled variant (~15M params, d=384, 8 blocks) with fixed tokenizer (frequency-based vocab)
- Standalone training script release (MIT license)
- Target: close the BPC gap with TinyStories-1M through longer training and proper vocab coverage
Previous Versions
- FlashLM v3 (13.6M) β ternary weights on FineWeb-Edu, val loss 6.80, incoherent output
- FlashLM v3 Demo β interactive demo of v3
Citation
@misc{flashlm-v4-bolt,
title={FlashLM v4 "Bolt": A Ternary Language Model Trained on CPU},
author={Cheng Chang},
year={2026},
url={https://huggingface.co/changcheng967/flashlm-v4-bolt}
}
License
MIT