| --- |
| license: gpl-3.0 |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| - mattwesney/General_Inquiry_Thinking-Chain-Of-Thought |
| - tatsu-lab/alpaca |
| - databricks/databricks-dolly-15k |
| - TeichAI/Step-3.5-Flash-2600x |
| - TeichAI/convo-v1 |
| language: |
| - en |
| tags: |
| - small |
| - glint |
| new_version: CompactAI-O/Glint-0.3 |
| --- |
| |
| # Glint-0.2 |
|
|
| > The pipe character incident. We do not talk about the pipe character incident. |
|
|
| Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs `|fdish||||!@|`. |
|
|
| Progress is not a straight line. |
|
|
| ## What you get |
|
|
| | File | What it is | |
| | :--- | :--- | |
| | `tokenizer.json` | Hybrid word/char tokenizer (~2,133 tokens) | |
| | `pretrain.pt` | Base pretrained checkpoint | |
| | `model.pt` | Instruction-tuned checkpoint (SFT) | |
| | `samples.jsonl` | Sample generations with metrics at checkpoints | |
| | `loss_curve.png` | Training loss across all phases | |
|
|
| ## Specs |
|
|
| | Thing | Value | |
| | :--- | :--- | |
| | **Architecture** | Transformer Decoder (GQA) | |
| | **Parameters** | ~700K | |
| | **Context** | 2,048 tokens | |
| | **Sliding Window** | 512 tokens | |
| | **d_model** | 128 | |
| | **Unique Layers** | 8 (tied to make 16 logical) | |
| | **Heads** | 4 | |
| | **KV Heads** | 2 | |
| | **FFN** | 224 | |
| | **Vocab** | ~2,133 (Hybrid Char + Word) | |
| | **Norm** | RMSNorm | |
| | **Position** | RoPE (25% fraction) | |
| | **Activation** | SwiGLU | |
| | **Multi-Token Prediction** | Horizons 2, 3, 4 | |
| |
| ## Fancy tricks |
| |
| - **Weight-tied layers:** 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective. |
| - **GQA:** 4 attention heads sharing 2 KV heads. Less cache, less compute. |
| - **Sliding window:** 512 tokens local, with periodic global layers for long-range context. |
| - **MTP:** Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training. |
| - **Hybrid tokenizer:** Word-level where possible, char fallback for the weird stuff. |
| - **Word token loss boost:** 3x loss on multi-character tokens so the model actually learns words. |
| - **Response-start weighting:** First 20 tokens of assistant responses get 3x weight. |
| |
| ## Training |
| |
| | Thing | Value | |
| | :--- | :--- | |
| | **Batch Size** | 48 | |
| | **Pretrain LR** | 8e-4 (min 1e-5) | |
| | **SFT LR** | 2e-4 (min 1e-5) | |
| | **Warmup** | 300 steps | |
| | **Weight Decay** | 0.02 | |
| | **Max Grad Norm** | 1.0 | |
| | **Checkpoint** | Every 1,000 steps | |
| | **Sampling** | Every 5,000 steps | |
| |
| ## Loss curve |
| |
|  |
| |
| ## Limitations |
| |
| - Repeats itself. |
| - Knows almost nothing. |
| - Research only. Not an assistant. |
| - Sometimes hallucinates pipes. |
| |
| --- |
| |
| *Built by [CompactAI](https://huggingface.co/CompactAI-O). We learn by failing.* |
| |