Lumia Tiny (PCT-V3)
Custom PyTorch language model with 969,880 parameters (~970K). Architecture built from first principles, not copied from existing papers.
Architecture Overview
Core Components
| Component | Name | Description |
|---|---|---|
| VCR | Variance-Controlled Residual | 96-dim bottleneck with RΒ² gating. Regularizes residual connections by projecting through low-rank space. |
| RPW | Relative Positional Warp | Learned 2D Fourier rotation matrix. Encodes relative position as continuous rotation in hidden space. |
| GPP | Gated Positional Projection | Position-aware gating with learned mixing weights. Combines positional and content information. |
| ALiBi | Attention with Linear Biases | Linear distance-based attention bias. No learned positional embeddings needed. |
| GQA | Grouped Query Attention | 8 query heads, 4 KV heads. KV heads shared across query groups for efficiency. |
| RMSNorm | Root Mean Square Normalization | Layer normalization without mean centering. Faster than LayerNorm. |
| SiLU | Sigmoid Linear Unit | SwiGLU activation in MLP. Smooth gating for better gradient flow. |
Model Specifications
Parameters: 969,880 (0.97M)
Vocab: 4,096 (BPE, 58 textbooks)
Hidden: 128
Layers: 6
Heads: 8 query / 4 KV
Head dim: 16
Code dim: 96 (VCR bottleneck)
Max seq len: 2,048
Tied embeds: Yes (token_embed = lm_head)
Architecture Diagram
Input tokens
β
βΌ
[Token Embedding] (4096 Γ 128)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Transformer Block Γ6 β
β βββββββββββββββββββββββββββββββββββ β
β β RMSNorm β GQA Attention β β
β β (ALiBi bias, GQA 8/4) β β
β β β β β
β β VCR: hidden β 96 β hidden β β
β β (variance-controlled) β β
β β β β β
β β Residual Add β β
β βββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββ β
β β RMSNorm β SwiGLU MLP β β
β β (gate Γ up β down) β β
β β β β β
β β RPW: relative position warp β β
β β GPP: gated positional proj β β
β β β β β
β β Residual Add β β
β βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
[RMSNorm] β [LM Head] β Logits
Training
- Dataset: AI-MO/NuminaMath-CoT (math reasoning with CoT)
- Method: QLoRA (NF4 quantization + LoRA r=8/Ξ±=16)
- Optimizer: AdamW, LR 5e-4, cosine schedule, warmup 10%
- Steps: 50,000 (effective batch 16)
- Tokenizer: BPE trained on 58 Project Gutenberg textbooks
Files
| File | Size | Description |
|---|---|---|
model_tiny.py |
16KB | Full architecture: VCR, RPW, GPP, GQA, TinyModel, QLoRA |
train_tiny.py |
21KB | Training loop: IterableDataset, CFT, checkpoint save |
train_tiny.yaml |
0.8KB | Training config: LR, batch, QLoRA, CFT settings |
best.pt |
2.6MB | Best checkpoint (QLoRA, NF4 quantized) |
best_fp32.pt |
3.8MB | Dequantized fp32 checkpoint (~970K params) |
dequantize_qlora.py |
2KB | Utility to dequantize QLoRA β fp32 |
gen_icon.py |
3KB | Project icon generator (neural network visualization) |
icon.png |
66KB | Project icon (512Γ512, neural network + LT logo) |
tokenizer.json |
125KB | BPE tokenizer (4096 vocab, 3874 merges) |
tokenizer_config.json |
0.6KB | Tokenizer config with chat template |
gen_tokenizer.py |
3.5KB | BPE tokenizer trainer (58 textbooks) |
infer_gguf.py |
16KB | Inference: GGUF + QLoRA + V3 checkpoint |
quantize_gguf.py |
4KB | Export to GGUF format |
prepare_tiny_data.py |
12KB | Data preparation utilities |
config.json |
0.4KB | HF AutoMap config for TinyModel |
Usage
Load Model (FP32)
from model_tiny import TinyModel
model = TinyModel()
model.load_state_dict(torch.load("best_fp32.pt"))
model.eval()
Load Model (QLoRA)
from model_tiny import TinyModel, apply_qlora
model = TinyModel()
model = apply_qlora(model, r=8, alpha=16)
model.load_state_dict(torch.load("best.pt"))
model.eval()
Inference
python infer_gguf.py --checkpoint best.pt --prompt "What is 2 + 3?"
Train from Scratch
python train_tiny.py # reads config/train_tiny.yaml
Key Innovations
VCR (Variance-Controlled Residual): Projects hidden β 96-dim code β hidden. Forces information through bottleneck, regularizing residual connections. RΒ² gating controls information flow.
RPW (Relative Positional Warp): 2D rotation matrix W_Ο encodes relative position as continuous rotation. No absolute position needed.
GPP (Gated Positional Projection): Learned mixing weights combine positional and content information. Gate = Ο(x @ W_mix).
Combined: VCR + RPW + GPP in every block. Not just attention β entire feed-forward path is position-aware.
License
Apache-2.0
- Downloads last month
- 184
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support