FlashLM v8.3 — CORTEX-VIII

CPU-trained language model. 6.57M parameters. Trained from scratch in 2 hours on a free-tier cloud CPU.

Architecture

CORTEX-VIII combines two complementary attention mechanisms per layer:

Component	Role	Config
Sliding Window Attention	Local context (W=32 tokens)	4 heads, d_head=64
Gated Delta Memory	Global context via delta rule	d_mem=32, learnable decay
Lookahead Value Heads	Predict future loss for search-guided decoding	1 per layer
SwiGLU FFN	Nonlinear mixing	d_ff=512
RMSNorm	Layer normalization	Pre-norm
Weight Tying	Share embed/output weights	—

Additional training features:

Entropy regularization (weight=0.01) — prevents peaked distributions that cause repetition
Nucleus sampling (top_p=0.85) + frequency penalty (1.2) at generation time
Zero weight decay on embedding/output layers to preserve low-frequency token representations

Training Details

Metric	Value
Dataset	TinyStories V2-GPT4
Training subset	First 10M tokens (~1.3 epochs)
Hardware	2 vCPU / 5GB RAM (free-tier cloud)
Training time	2 hours
Validation PPL	2.50 (best)
Throughput	1,861 tokens/sec
Steps	1,636
Total tokens seen	13.4M
Batch size	4 x 8 gradient accumulation
Peak LR	5e-4 (cosine decay to 1e-5)
Warmup	100 steps

Model Lineup

Version	Architecture	Params	PPL	Highlight
v7.4 CORTEX-VIII	Gated DeltaNet + SWA	6.6M	2.33	Best PPL
v8.1 SearchLM	CORTEX + lookahead value heads	6.6M	2.40	V_Corr +0.66
v8.2 CORTEX-VIII	+ 20M subset + entropy reg	6.6M	2.42	Broke repetition loops
v8.3 CORTEX-VIII	+ 10M subset, D_FF=512	6.6M	2.50	Best generation diversity
v8.4 CORTEX-IX	+ full context SWA + 2x memory	~6.8M	TBD	In progress

Files

File	Description
`best.pt`	Best checkpoint (lowest validation loss)
`final.pt`	Final checkpoint with full config and training results
`tokenizer.json`	Byte-level BPE tokenizer (vocab=4,096)
`results.json`	Training metrics summary

Usage

import torch
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Load model checkpoint
ckpt = torch.load("best.pt", map_location="cpu")
print(f"Val PPL: {ckpt['val_ppl']:.2f}")

# For full model architecture, see:
# https://github.com/changcheng967/FlashLM/blob/main/v8/train_v83.py

Generation Example

Prompt: "Once upon a time"
Output: "Once upon a time . sun like . helped look this ! began bed to .
         thought cake a and fish him Tom Mr Bunny fish . looked Ben place !
         thinks book ..."

Generation uses nucleus sampling (temperature=1.2, top_p=0.85) with frequency penalty (1.2) to maximize diversity.

Limitations

Grammar is broken — the model learned vocabulary and word statistics (PPL 2.50) but not sentence structure. Greedy decoding produces repetition loops; sampling produces diverse but ungrammatical text.
SWA window too small — W=32 (~8 words) can't capture cross-sentence dependencies needed for grammar.
Undertrained — 13.4M tokens seen vs 574M in full dataset. The model needs more data coverage.
v8.4 (CORTEX-IX) addresses these with full-context attention (W=256) and doubled memory capacity.

Citation

@misc{flashlm,
  author = {Cheng Chang},
  title = {FlashLM: CPU-Native Ternary Language Models},
  year = {2026},
  url = {https://github.com/changcheng967/FlashLM}
}

changcheng967
/

flashlm-v8.3-cortex-viii