FlashLM v8.3 β CORTEX-VIII
CPU-trained language model. 6.57M parameters. Trained from scratch in 2 hours on a free-tier cloud CPU.
Architecture
CORTEX-VIII combines two complementary attention mechanisms per layer:
| Component | Role | Config |
|---|---|---|
| Sliding Window Attention | Local context (W=32 tokens) | 4 heads, d_head=64 |
| Gated Delta Memory | Global context via delta rule | d_mem=32, learnable decay |
| Lookahead Value Heads | Predict future loss for search-guided decoding | 1 per layer |
| SwiGLU FFN | Nonlinear mixing | d_ff=512 |
| RMSNorm | Layer normalization | Pre-norm |
| Weight Tying | Share embed/output weights | β |
Additional training features:
- Entropy regularization (weight=0.01) β prevents peaked distributions that cause repetition
- Nucleus sampling (top_p=0.85) + frequency penalty (1.2) at generation time
- Zero weight decay on embedding/output layers to preserve low-frequency token representations
Training Details
| Metric | Value |
|---|---|
| Dataset | TinyStories V2-GPT4 |
| Training subset | First 10M tokens (~1.3 epochs) |
| Hardware | 2 vCPU / 5GB RAM (free-tier cloud) |
| Training time | 2 hours |
| Validation PPL | 2.50 (best) |
| Throughput | 1,861 tokens/sec |
| Steps | 1,636 |
| Total tokens seen | 13.4M |
| Batch size | 4 x 8 gradient accumulation |
| Peak LR | 5e-4 (cosine decay to 1e-5) |
| Warmup | 100 steps |
Model Lineup
| Version | Architecture | Params | PPL | Highlight |
|---|---|---|---|---|
| v7.4 CORTEX-VIII | Gated DeltaNet + SWA | 6.6M | 2.33 | Best PPL |
| v8.1 SearchLM | CORTEX + lookahead value heads | 6.6M | 2.40 | V_Corr +0.66 |
| v8.2 CORTEX-VIII | + 20M subset + entropy reg | 6.6M | 2.42 | Broke repetition loops |
| v8.3 CORTEX-VIII | + 10M subset, D_FF=512 | 6.6M | 2.50 | Best generation diversity |
| v8.4 CORTEX-IX | + full context SWA + 2x memory | ~6.8M | TBD | In progress |
Files
| File | Description |
|---|---|
best.pt |
Best checkpoint (lowest validation loss) |
final.pt |
Final checkpoint with full config and training results |
tokenizer.json |
Byte-level BPE tokenizer (vocab=4,096) |
results.json |
Training metrics summary |
Usage
import torch
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
# Load model checkpoint
ckpt = torch.load("best.pt", map_location="cpu")
print(f"Val PPL: {ckpt['val_ppl']:.2f}")
# For full model architecture, see:
# https://github.com/changcheng967/FlashLM/blob/main/v8/train_v83.py
Generation Example
Prompt: "Once upon a time"
Output: "Once upon a time . sun like . helped look this ! began bed to .
thought cake a and fish him Tom Mr Bunny fish . looked Ben place !
thinks book ..."
Generation uses nucleus sampling (temperature=1.2, top_p=0.85) with frequency penalty (1.2) to maximize diversity.
Limitations
- Grammar is broken β the model learned vocabulary and word statistics (PPL 2.50) but not sentence structure. Greedy decoding produces repetition loops; sampling produces diverse but ungrammatical text.
- SWA window too small β W=32 (~8 words) can't capture cross-sentence dependencies needed for grammar.
- Undertrained β 13.4M tokens seen vs 574M in full dataset. The model needs more data coverage.
- v8.4 (CORTEX-IX) addresses these with full-context attention (W=256) and doubled memory capacity.
Citation
@misc{flashlm,
author = {Cheng Chang},
title = {FlashLM: CPU-Native Ternary Language Models},
year = {2026},
url = {https://github.com/changcheng967/FlashLM}
}
Links
- GitHub: changcheng967/FlashLM
- Code: train_v83.py
Trained by Cheng Chang. Architecture design assistance by Claude Code (Anthropic).
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support