Lumia Tiny (PCT-V3)

Custom PyTorch language model with 969,880 parameters (~970K). Architecture built from first principles, not copied from existing papers.

Architecture Overview

Core Components

Component	Name	Description
VCR	Variance-Controlled Residual	96-dim bottleneck with R² gating. Regularizes residual connections by projecting through low-rank space.
RPW	Relative Positional Warp	Learned 2D Fourier rotation matrix. Encodes relative position as continuous rotation in hidden space.
GPP	Gated Positional Projection	Position-aware gating with learned mixing weights. Combines positional and content information.
ALiBi	Attention with Linear Biases	Linear distance-based attention bias. No learned positional embeddings needed.
GQA	Grouped Query Attention	8 query heads, 4 KV heads. KV heads shared across query groups for efficiency.
RMSNorm	Root Mean Square Normalization	Layer normalization without mean centering. Faster than LayerNorm.
SiLU	Sigmoid Linear Unit	SwiGLU activation in MLP. Smooth gating for better gradient flow.

Model Specifications

Parameters:     969,880 (0.97M)
Vocab:          4,096 (BPE, 58 textbooks)
Hidden:         128
Layers:         6
Heads:          8 query / 4 KV
Head dim:       16
Code dim:       96 (VCR bottleneck)
Max seq len:    2,048
Tied embeds:    Yes (token_embed = lm_head)

Architecture Diagram

Input tokens
    │
    ▼
[Token Embedding] (4096 × 128)
    │
    ▼
┌─────────────────────────────────────────┐
│           Transformer Block ×6          │
│  ┌─────────────────────────────────┐    │
│  │  RMSNorm → GQA Attention       │    │
│  │  (ALiBi bias, GQA 8/4)         │    │
│  │      ↓                         │    │
│  │  VCR: hidden → 96 → hidden     │    │
│  │  (variance-controlled)         │    │
│  │      ↓                         │    │
│  │  Residual Add                  │    │
│  └─────────────────────────────────┘    │
│                 │                        │
│  ┌─────────────────────────────────┐    │
│  │  RMSNorm → SwiGLU MLP          │    │
│  │  (gate × up → down)            │    │
│  │      ↓                         │    │
│  │  RPW: relative position warp   │    │
│  │  GPP: gated positional proj    │    │
│  │      ↓                         │    │
│  │  Residual Add                  │    │
│  └─────────────────────────────────┘    │
└─────────────────────────────────────────┘
    │
    ▼
[RMSNorm] → [LM Head] → Logits

Training

Dataset: AI-MO/NuminaMath-CoT (math reasoning with CoT)
Method: QLoRA (NF4 quantization + LoRA r=8/α=16)
Optimizer: AdamW, LR 5e-4, cosine schedule, warmup 10%
Steps: 50,000 (effective batch 16)
Tokenizer: BPE trained on 58 Project Gutenberg textbooks

Files

File	Size	Description
`model_tiny.py`	16KB	Full architecture: VCR, RPW, GPP, GQA, TinyModel, QLoRA
`train_tiny.py`	21KB	Training loop: IterableDataset, CFT, checkpoint save
`train_tiny.yaml`	0.8KB	Training config: LR, batch, QLoRA, CFT settings
`best.pt`	2.6MB	Best checkpoint (QLoRA, NF4 quantized)
`best_fp32.pt`	3.8MB	Dequantized fp32 checkpoint (~970K params)
`dequantize_qlora.py`	2KB	Utility to dequantize QLoRA → fp32
`gen_icon.py`	3KB	Project icon generator (neural network visualization)
`icon.png`	66KB	Project icon (512×512, neural network + LT logo)
`tokenizer.json`	125KB	BPE tokenizer (4096 vocab, 3874 merges)
`tokenizer_config.json`	0.6KB	Tokenizer config with chat template
`gen_tokenizer.py`	3.5KB	BPE tokenizer trainer (58 textbooks)
`infer_gguf.py`	16KB	Inference: GGUF + QLoRA + V3 checkpoint
`quantize_gguf.py`	4KB	Export to GGUF format
`prepare_tiny_data.py`	12KB	Data preparation utilities
`config.json`	0.4KB	HF AutoMap config for TinyModel

Usage

Load Model (FP32)

from model_tiny import TinyModel

model = TinyModel()
model.load_state_dict(torch.load("best_fp32.pt"))
model.eval()

Load Model (QLoRA)

from model_tiny import TinyModel, apply_qlora

model = TinyModel()
model = apply_qlora(model, r=8, alpha=16)
model.load_state_dict(torch.load("best.pt"))
model.eval()

Inference

python infer_gguf.py --checkpoint best.pt --prompt "What is 2 + 3?"

Train from Scratch

python train_tiny.py  # reads config/train_tiny.yaml

Key Innovations

VCR (Variance-Controlled Residual): Projects hidden → 96-dim code → hidden. Forces information through bottleneck, regularizing residual connections. R² gating controls information flow.
RPW (Relative Positional Warp): 2D rotation matrix W_φ encodes relative position as continuous rotation. No absolute position needed.
GPP (Gated Positional Projection): Learned mixing weights combine positional and content information. Gate = σ(x @ W_mix).
Combined: VCR + RPW + GPP in every block. Not just attention — entire feed-forward path is position-aware.

License

Apache-2.0

Downloads last month: 184

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support