--- license: gpl-3.0 datasets: - HuggingFaceFW/fineweb-edu - mattwesney/General_Inquiry_Thinking-Chain-Of-Thought - tatsu-lab/alpaca - databricks/databricks-dolly-15k - TeichAI/Step-3.5-Flash-2600x - TeichAI/convo-v1 language: - en tags: - small - glint new_version: CompactAI-O/Glint-0.3 --- # Glint-0.2 > The pipe character incident. We do not talk about the pipe character incident. Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs `|fdish||||!@|`. Progress is not a straight line. ## What you get | File | What it is | | :--- | :--- | | `tokenizer.json` | Hybrid word/char tokenizer (~2,133 tokens) | | `pretrain.pt` | Base pretrained checkpoint | | `model.pt` | Instruction-tuned checkpoint (SFT) | | `samples.jsonl` | Sample generations with metrics at checkpoints | | `loss_curve.png` | Training loss across all phases | ## Specs | Thing | Value | | :--- | :--- | | **Architecture** | Transformer Decoder (GQA) | | **Parameters** | ~700K | | **Context** | 2,048 tokens | | **Sliding Window** | 512 tokens | | **d_model** | 128 | | **Unique Layers** | 8 (tied to make 16 logical) | | **Heads** | 4 | | **KV Heads** | 2 | | **FFN** | 224 | | **Vocab** | ~2,133 (Hybrid Char + Word) | | **Norm** | RMSNorm | | **Position** | RoPE (25% fraction) | | **Activation** | SwiGLU | | **Multi-Token Prediction** | Horizons 2, 3, 4 | ## Fancy tricks - **Weight-tied layers:** 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective. - **GQA:** 4 attention heads sharing 2 KV heads. Less cache, less compute. - **Sliding window:** 512 tokens local, with periodic global layers for long-range context. - **MTP:** Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training. - **Hybrid tokenizer:** Word-level where possible, char fallback for the weird stuff. - **Word token loss boost:** 3x loss on multi-character tokens so the model actually learns words. - **Response-start weighting:** First 20 tokens of assistant responses get 3x weight. ## Training | Thing | Value | | :--- | :--- | | **Batch Size** | 48 | | **Pretrain LR** | 8e-4 (min 1e-5) | | **SFT LR** | 2e-4 (min 1e-5) | | **Warmup** | 300 steps | | **Weight Decay** | 0.02 | | **Max Grad Norm** | 1.0 | | **Checkpoint** | Every 1,000 steps | | **Sampling** | Every 5,000 steps | ## Loss curve ![loss](model/loss_curve.png) ## Limitations - Repeats itself. - Knows almost nothing. - Research only. Not an assistant. - Sometimes hallucinates pipes. --- *Built by [CompactAI](https://huggingface.co/CompactAI-O). We learn by failing.*