FlashLM v4 "Bolt" β€” 4.3M Ternary Language Model Trained on CPU in 2 Hours

A 4.3M parameter language model with ternary weights (-1, 0, +1) trained from scratch on a free-tier 2-thread CPU in 2 hours. No GPU used at any point. The model generates coherent children's stories with dialogue and narrative structure.

BPC Evaluation

Bits-per-character (BPC) is a tokenizer-independent metric that fairly compares models with different vocabularies. It measures how well a model compresses raw text: BPC = total_cross_entropy_nats / (total_characters Γ— ln(2)). Lower is better.

FlashLM v4 "Bolt" TinyStories-1M
Parameters 4.3M (ternary) 3.7M (float32)
BPC 0.8798 0.6182
Perplexity 15.05 6.72
Training hardware 2-thread CPU (free tier) V100 GPU
Training time 2 hours Hours (GPU)
Tokens seen 10.6M ~470M
Architecture Gated Conv + GLU (no attention) GPT-Neo (attention)
Weight precision Ternary (-1, 0, +1) Float32

FlashLM v4 has only seen 2.3% of the training data that TinyStories-1M used. The validation loss curve was still declining when the 2-hour time limit was reached β€” this model is undertrained, not underdesigned. Extended training is planned.

Evaluated on 500 TinyStories validation stories (405,081 characters).


What Makes This Interesting

This model was trained from scratch on a free-tier 2-thread CPU notebook (Deepnote) with 5 GB of RAM. No GPU was used β€” not for pretraining, not for fine-tuning, not for inference.

The entire model fits in 16.7 MB and uses only addition and subtraction for its core operations during inference. Every weight in the model body is constrained to one of three values: -1, 0, or +1.

Despite seeing only 2.3% of the training data, the model generates coherent children's stories with dialogue, character interactions, and basic narrative structure.


Evolution from v3

FlashLM v4 is a complete redesign from v3. The two versions are not directly comparable by validation loss because they were trained on different datasets with different vocabularies.

FlashLM v3 FlashLM v4 "Bolt"
Parameters 13.6M 4.3M
Vocab size 50,257 (full GPT-2) 10,000 (top 10K)
Dataset FineWeb-Edu TinyStories
Tokens seen 32M 10.6M
Training time 1.2 hours 2.0 hours
Hardware 2-thread CPU 2-thread CPU
Token mixer Custom ternary layers Gated causal convolution
Channel mixer None Ternary GLU (SiLU)
Normalization LayerNorm RMSNorm
Output head Full 50,257 projection (86% of compute) Weight-tied 10K projection
Output quality Incoherent / random tokens Coherent children's stories with dialogue

What went wrong with v3

v3's biggest bottleneck was the output projection layer. With a 50,257-token vocabulary and d_model=256, the softmax head consumed 86% of all training compute, leaving the ternary model core starved. Training on FineWeb-Edu (a broad web corpus) with a tiny model made things worse β€” the data was too diverse for a 13.6M-param model to learn anything coherent.

What v4 changed

v4 attacked every root cause: shrink the vocabulary from 50K to 10K (eliminating the softmax bottleneck), switch to TinyStories (a focused corpus proven to work at small scale), replace the token mixer with gated causal convolutions, add a proper GLU channel mixer, and use weight-tied embeddings. The result: a model 3x smaller that produces coherent text instead of gibberish.


Architecture

FlashLM v4 "Bolt" is a non-transformer sequential language model that replaces attention with gated causal convolutions and uses ternary (1.58-bit) weights throughout.

Overview

Input Token IDs
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Embedding   β”‚  10,000 Γ— 192 (float, weight-tied with output head)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BoltBlock   β”‚  Γ— 6 (each block identical structure, independent weights)
β”‚              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ RMSNorm β”‚ β”‚
β”‚  β”‚ GatedConvβ”‚ β”‚  ← Ternary causal depthwise conv (kernel=8) + gating
β”‚  β”‚ + Residualβ”‚ β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚  β”‚ RMSNorm β”‚ β”‚
β”‚  β”‚ TernaryGLUβ”‚ β”‚  ← Ternary gated linear unit (SiLU activation)
β”‚  β”‚ + Residualβ”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   RMSNorm    β”‚
β”‚  Output Head β”‚  Weight-tied to embedding (float)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
   Logits (10,000 vocab)

Component Details

BitLinear (Ternary Weights)

All linear projections in the model body use ternary quantization with a straight-through estimator:

alpha = mean(|W|)
W_ternary = clamp(round(W / alpha), -1, +1) Γ— alpha

During training, gradients flow through the quantization via the straight-through estimator (STE). During inference, weights can be stored as 2-bit integers and all multiplications become additions, subtractions, or zeros.

GatedConvMixer (Token Mixing)

Instead of self-attention, FlashLM v4 uses a gated causal depthwise convolution:

gv = BitLinear_up(x)           # project to 2*dim
gate, val = split(gv)          # split into gate and value
gate = sigmoid(gate)
h = CausalDepthwiseConv1D(val, kernel_size=8)
output = BitLinear_down(h * gate)

This gives the model a receptive field of 8 tokens per layer, or 48 tokens across all 6 layers. The operation is O(T) in sequence length with no quadratic attention cost.

TernaryGLU (Channel Mixing)

The feed-forward network uses a gated linear unit with ternary weights:

output = BitLinear_down(SiLU(BitLinear_gate(x)) * BitLinear_up(x))

Dimensions: 192 β†’ 512 β†’ 192 (expansion ratio β‰ˆ 2.67x).

RMSNorm

Root Mean Square Layer Normalization (no bias, no mean subtraction) before every sub-layer.

Weight-Tied Embedding

The input embedding and output projection share the same weight matrix (float32, 10,000 Γ— 192). This is the only float component in the model; everything else is ternary.


Training Configuration

Setting Value
Dataset TinyStories (streamed from HuggingFace)
Tokenizer GPT-2 (tiktoken), top 10K tokens
Vocab size 10,000
Hidden dim 192
Blocks 6
Conv kernel 8
Sequence length 256
Batch size 8
Optimizer AdamW (weight decay 0.01)
Peak learning rate 4e-3
LR schedule Cosine with 200-step warmup
Gradient clipping 1.0
Total steps 5,199
Total tokens 10,647,552
Wall-clock time 7,200 seconds (2.0 hours)
Hardware 2 vCPU, 5 GB RAM (Deepnote free tier)
Training speed 1,479 tokens/sec avg
Best val loss 2.0976 (step 5000)

Training Loss Curve

Step Val Loss
500 2.8442
1000 2.5828
1500 2.4390
2000 2.2577
2500 2.1870
3000 2.1345
3500 2.0808
4000 2.1490
4500 2.1538
5000 2.0976

The loss curve shows no sign of plateauing. Extended training on more powerful CPU hardware is expected to push BPC significantly lower.


Inference

At inference time, the ternary weights mean the core model operations reduce to:

  • Weight = +1: add the activation
  • Weight = -1: subtract the activation
  • Weight = 0: skip (no operation)

No floating-point multiplications are needed in the model body. The only float operations are the embedding lookup, RMSNorm scaling, and the tied output projection.


Sample Generations

From the final checkpoint (step 5199):

Prompt: "Once upon a time"

Once upon a time, there was a little girl named []. She loved to play outside and explore the world. One day, she wanted to go outside. She went to the [] and saw a big tree. She wanted to catch it, but the [] was too small.

Prompt: "[] and his mom went to the"

[] and his mom went to the []. They had lots of fun and [] each other. And they never gave up. Once upon a time, there was a little girl called []. She loved to explore and find new things.

([] indicates UNK tokens from the 10K vocab limitation β€” fixable by expanding vocabulary)


Limitations

  • Vocabulary coverage: The 10K token vocabulary doesn't cover all words in TinyStories. Some outputs contain UNK tokens. Expanding to 15-20K tokens would fix this.
  • Training data: Only 10.6M of TinyStories' 470M tokens were seen. The model is significantly undertrained.
  • Receptive field: The 48-token effective receptive field (8 kernel Γ— 6 layers) limits long-range coherence compared to attention-based models.
  • Domain: Trained exclusively on children's stories. Not a general-purpose language model.

What's Next

  • Extended training on a Ryzen 7950X3D (16 cores, 96MB V-Cache, 128GB RAM) for multiple days
  • Scaled variant (~15M params, d=384, 8 blocks) with fixed tokenizer (frequency-based vocab)
  • Standalone training script release (MIT license)
  • Target: close the BPC gap with TinyStories-1M through longer training and proper vocab coverage

Previous Versions


Citation

@misc{flashlm-v4-bolt,
  title={FlashLM v4 "Bolt": A Ternary Language Model Trained on CPU},
  author={Cheng Chang},
  year={2026},
  url={https://huggingface.co/changcheng967/flashlm-v4-bolt}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train changcheng967/flashlm-v4-bolt

Space using changcheng967/flashlm-v4-bolt 1