Lumia Tiny

Lumia Tiny (PCT-V3)

Custom PyTorch language model with 969,880 parameters (~970K). Architecture built from first principles, not copied from existing papers.

Architecture Overview

Core Components

Component Name Description
VCR Variance-Controlled Residual 96-dim bottleneck with RΒ² gating. Regularizes residual connections by projecting through low-rank space.
RPW Relative Positional Warp Learned 2D Fourier rotation matrix. Encodes relative position as continuous rotation in hidden space.
GPP Gated Positional Projection Position-aware gating with learned mixing weights. Combines positional and content information.
ALiBi Attention with Linear Biases Linear distance-based attention bias. No learned positional embeddings needed.
GQA Grouped Query Attention 8 query heads, 4 KV heads. KV heads shared across query groups for efficiency.
RMSNorm Root Mean Square Normalization Layer normalization without mean centering. Faster than LayerNorm.
SiLU Sigmoid Linear Unit SwiGLU activation in MLP. Smooth gating for better gradient flow.

Model Specifications

Parameters:     969,880 (0.97M)
Vocab:          4,096 (BPE, 58 textbooks)
Hidden:         128
Layers:         6
Heads:          8 query / 4 KV
Head dim:       16
Code dim:       96 (VCR bottleneck)
Max seq len:    2,048
Tied embeds:    Yes (token_embed = lm_head)

Architecture Diagram

Input tokens
    β”‚
    β–Ό
[Token Embedding] (4096 Γ— 128)
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Transformer Block Γ—6          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  RMSNorm β†’ GQA Attention       β”‚    β”‚
β”‚  β”‚  (ALiBi bias, GQA 8/4)         β”‚    β”‚
β”‚  β”‚      ↓                         β”‚    β”‚
β”‚  β”‚  VCR: hidden β†’ 96 β†’ hidden     β”‚    β”‚
β”‚  β”‚  (variance-controlled)         β”‚    β”‚
β”‚  β”‚      ↓                         β”‚    β”‚
β”‚  β”‚  Residual Add                  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                 β”‚                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  RMSNorm β†’ SwiGLU MLP          β”‚    β”‚
β”‚  β”‚  (gate Γ— up β†’ down)            β”‚    β”‚
β”‚  β”‚      ↓                         β”‚    β”‚
β”‚  β”‚  RPW: relative position warp   β”‚    β”‚
β”‚  β”‚  GPP: gated positional proj    β”‚    β”‚
β”‚  β”‚      ↓                         β”‚    β”‚
β”‚  β”‚  Residual Add                  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
[RMSNorm] β†’ [LM Head] β†’ Logits

Training

  • Dataset: AI-MO/NuminaMath-CoT (math reasoning with CoT)
  • Method: QLoRA (NF4 quantization + LoRA r=8/Ξ±=16)
  • Optimizer: AdamW, LR 5e-4, cosine schedule, warmup 10%
  • Steps: 50,000 (effective batch 16)
  • Tokenizer: BPE trained on 58 Project Gutenberg textbooks

Files

File Size Description
model_tiny.py 16KB Full architecture: VCR, RPW, GPP, GQA, TinyModel, QLoRA
train_tiny.py 21KB Training loop: IterableDataset, CFT, checkpoint save
train_tiny.yaml 0.8KB Training config: LR, batch, QLoRA, CFT settings
best.pt 2.6MB Best checkpoint (QLoRA, NF4 quantized)
best_fp32.pt 3.8MB Dequantized fp32 checkpoint (~970K params)
dequantize_qlora.py 2KB Utility to dequantize QLoRA β†’ fp32
gen_icon.py 3KB Project icon generator (neural network visualization)
icon.png 66KB Project icon (512Γ—512, neural network + LT logo)
tokenizer.json 125KB BPE tokenizer (4096 vocab, 3874 merges)
tokenizer_config.json 0.6KB Tokenizer config with chat template
gen_tokenizer.py 3.5KB BPE tokenizer trainer (58 textbooks)
infer_gguf.py 16KB Inference: GGUF + QLoRA + V3 checkpoint
quantize_gguf.py 4KB Export to GGUF format
prepare_tiny_data.py 12KB Data preparation utilities
config.json 0.4KB HF AutoMap config for TinyModel

Usage

Load Model (FP32)

from model_tiny import TinyModel

model = TinyModel()
model.load_state_dict(torch.load("best_fp32.pt"))
model.eval()

Load Model (QLoRA)

from model_tiny import TinyModel, apply_qlora

model = TinyModel()
model = apply_qlora(model, r=8, alpha=16)
model.load_state_dict(torch.load("best.pt"))
model.eval()

Inference

python infer_gguf.py --checkpoint best.pt --prompt "What is 2 + 3?"

Train from Scratch

python train_tiny.py  # reads config/train_tiny.yaml

Key Innovations

  1. VCR (Variance-Controlled Residual): Projects hidden β†’ 96-dim code β†’ hidden. Forces information through bottleneck, regularizing residual connections. RΒ² gating controls information flow.

  2. RPW (Relative Positional Warp): 2D rotation matrix W_Ο† encodes relative position as continuous rotation. No absolute position needed.

  3. GPP (Gated Positional Projection): Learned mixing weights combine positional and content information. Gate = Οƒ(x @ W_mix).

  4. Combined: VCR + RPW + GPP in every block. Not just attention β€” entire feed-forward path is position-aware.

License

Apache-2.0

Downloads last month
184
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support