---
license: gpl-3.0
datasets:
- HuggingFaceFW/fineweb-edu
- mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
- tatsu-lab/alpaca
- databricks/databricks-dolly-15k
- TeichAI/Step-3.5-Flash-2600x
- TeichAI/convo-v1
language:
- en
tags:
- small
- glint
new_version: CompactAI-O/Glint-0.3
---

# Glint-0.2

> The pipe character incident. We do not talk about the pipe character incident.

Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs `|fdish||||!@|`.

Progress is not a straight line.

## What you get

| File | What it is |
| :--- | :--- |
| `tokenizer.json` | Hybrid word/char tokenizer (~2,133 tokens) |
| `pretrain.pt` | Base pretrained checkpoint |
| `model.pt` | Instruction-tuned checkpoint (SFT) |
| `samples.jsonl` | Sample generations with metrics at checkpoints |
| `loss_curve.png` | Training loss across all phases |

## Specs

| Thing | Value |
| :--- | :--- |
| **Architecture** | Transformer Decoder (GQA) |
| **Parameters** | ~700K |
| **Context** | 2,048 tokens |
| **Sliding Window** | 512 tokens |
| **d_model** | 128 |
| **Unique Layers** | 8 (tied to make 16 logical) |
| **Heads** | 4 |
| **KV Heads** | 2 |
| **FFN** | 224 |
| **Vocab** | ~2,133 (Hybrid Char + Word) |
| **Norm** | RMSNorm |
| **Position** | RoPE (25% fraction) |
| **Activation** | SwiGLU |
| **Multi-Token Prediction** | Horizons 2, 3, 4 |

## Fancy tricks

- **Weight-tied layers:** 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective.
- **GQA:** 4 attention heads sharing 2 KV heads. Less cache, less compute.
- **Sliding window:** 512 tokens local, with periodic global layers for long-range context.
- **MTP:** Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training.
- **Hybrid tokenizer:** Word-level where possible, char fallback for the weird stuff.
- **Word token loss boost:** 3x loss on multi-character tokens so the model actually learns words.
- **Response-start weighting:** First 20 tokens of assistant responses get 3x weight.

## Training

| Thing | Value |
| :--- | :--- |
| **Batch Size** | 48 |
| **Pretrain LR** | 8e-4 (min 1e-5) |
| **SFT LR** | 2e-4 (min 1e-5) |
| **Warmup** | 300 steps |
| **Weight Decay** | 0.02 |
| **Max Grad Norm** | 1.0 |
| **Checkpoint** | Every 1,000 steps |
| **Sampling** | Every 5,000 steps |

## Loss curve

![loss](model/loss_curve.png)

## Limitations

- Repeats itself.
- Knows almost nothing.
- Research only. Not an assistant.
- Sometimes hallucinates pipes.

---

*Built by [CompactAI](https://huggingface.co/CompactAI-O). We learn by failing.*