FlashLM v8.3 β€” CORTEX-VIII

CPU-trained language model. 6.57M parameters. Trained from scratch in 2 hours on a free-tier cloud CPU.


Architecture

CORTEX-VIII combines two complementary attention mechanisms per layer:

Component Role Config
Sliding Window Attention Local context (W=32 tokens) 4 heads, d_head=64
Gated Delta Memory Global context via delta rule d_mem=32, learnable decay
Lookahead Value Heads Predict future loss for search-guided decoding 1 per layer
SwiGLU FFN Nonlinear mixing d_ff=512
RMSNorm Layer normalization Pre-norm
Weight Tying Share embed/output weights β€”

Additional training features:

  • Entropy regularization (weight=0.01) β€” prevents peaked distributions that cause repetition
  • Nucleus sampling (top_p=0.85) + frequency penalty (1.2) at generation time
  • Zero weight decay on embedding/output layers to preserve low-frequency token representations

Training Details

Metric Value
Dataset TinyStories V2-GPT4
Training subset First 10M tokens (~1.3 epochs)
Hardware 2 vCPU / 5GB RAM (free-tier cloud)
Training time 2 hours
Validation PPL 2.50 (best)
Throughput 1,861 tokens/sec
Steps 1,636
Total tokens seen 13.4M
Batch size 4 x 8 gradient accumulation
Peak LR 5e-4 (cosine decay to 1e-5)
Warmup 100 steps

Model Lineup

Version Architecture Params PPL Highlight
v7.4 CORTEX-VIII Gated DeltaNet + SWA 6.6M 2.33 Best PPL
v8.1 SearchLM CORTEX + lookahead value heads 6.6M 2.40 V_Corr +0.66
v8.2 CORTEX-VIII + 20M subset + entropy reg 6.6M 2.42 Broke repetition loops
v8.3 CORTEX-VIII + 10M subset, D_FF=512 6.6M 2.50 Best generation diversity
v8.4 CORTEX-IX + full context SWA + 2x memory ~6.8M TBD In progress

Files

File Description
best.pt Best checkpoint (lowest validation loss)
final.pt Final checkpoint with full config and training results
tokenizer.json Byte-level BPE tokenizer (vocab=4,096)
results.json Training metrics summary

Usage

import torch
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Load model checkpoint
ckpt = torch.load("best.pt", map_location="cpu")
print(f"Val PPL: {ckpt['val_ppl']:.2f}")

# For full model architecture, see:
# https://github.com/changcheng967/FlashLM/blob/main/v8/train_v83.py

Generation Example

Prompt: "Once upon a time"
Output: "Once upon a time . sun like . helped look this ! began bed to .
         thought cake a and fish him Tom Mr Bunny fish . looked Ben place !
         thinks book ..."

Generation uses nucleus sampling (temperature=1.2, top_p=0.85) with frequency penalty (1.2) to maximize diversity.

Limitations

  • Grammar is broken β€” the model learned vocabulary and word statistics (PPL 2.50) but not sentence structure. Greedy decoding produces repetition loops; sampling produces diverse but ungrammatical text.
  • SWA window too small β€” W=32 (~8 words) can't capture cross-sentence dependencies needed for grammar.
  • Undertrained β€” 13.4M tokens seen vs 574M in full dataset. The model needs more data coverage.
  • v8.4 (CORTEX-IX) addresses these with full-context attention (W=256) and doubled memory capacity.

Citation

@misc{flashlm,
  author = {Cheng Chang},
  title = {FlashLM: CPU-Native Ternary Language Models},
  year = {2026},
  url = {https://github.com/changcheng967/FlashLM}
}

Links


Trained by Cheng Chang. Architecture design assistance by Claude Code (Anthropic).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support