File size: 2,764 Bytes

968f5a6
80fc0da
968f5a6
 
 
 
 
 
 
 
 
 
 
80fc0da
 
968f5a6
 
80fc0da
968f5a6
80fc0da
968f5a6
80fc0da
968f5a6
80fc0da
968f5a6
80fc0da
968f5a6
80fc0da
968f5a6
80fc0da
 
 
 
 
968f5a6
80fc0da
968f5a6
80fc0da
968f5a6
 
 
80fc0da
968f5a6
80fc0da
 
 
 
 
 
 
 
968f5a6
80fc0da
968f5a6
80fc0da
968f5a6
80fc0da
 
 
 
 
 
 
968f5a6
80fc0da
968f5a6
80fc0da
968f5a6
 
 
 
 
 
 
80fc0da
968f5a6
 
80fc0da
968f5a6
80fc0da
968f5a6
80fc0da
968f5a6
80fc0da
 
 
 
968f5a6
 
 
80fc0da

---
license: gpl-3.0
datasets:
- HuggingFaceFW/fineweb-edu
- mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
- tatsu-lab/alpaca
- databricks/databricks-dolly-15k
- TeichAI/Step-3.5-Flash-2600x
- TeichAI/convo-v1
language:
- en
tags:
- small
- glint
new_version: CompactAI-O/Glint-0.3
---

# Glint-0.2

> The pipe character incident. We do not talk about the pipe character incident.

Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs `|fdish||||!@|`.

Progress is not a straight line.

## What you get

| File | What it is |
| :--- | :--- |
| `tokenizer.json` | Hybrid word/char tokenizer (~2,133 tokens) |
| `pretrain.pt` | Base pretrained checkpoint |
| `model.pt` | Instruction-tuned checkpoint (SFT) |
| `samples.jsonl` | Sample generations with metrics at checkpoints |
| `loss_curve.png` | Training loss across all phases |

## Specs

| Thing | Value |
| :--- | :--- |
| **Architecture** | Transformer Decoder (GQA) |
| **Parameters** | ~700K |
| **Context** | 2,048 tokens |
| **Sliding Window** | 512 tokens |
| **d_model** | 128 |
| **Unique Layers** | 8 (tied to make 16 logical) |
| **Heads** | 4 |
| **KV Heads** | 2 |
| **FFN** | 224 |
| **Vocab** | ~2,133 (Hybrid Char + Word) |
| **Norm** | RMSNorm |
| **Position** | RoPE (25% fraction) |
| **Activation** | SwiGLU |
| **Multi-Token Prediction** | Horizons 2, 3, 4 |

## Fancy tricks

- **Weight-tied layers:** 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective.
- **GQA:** 4 attention heads sharing 2 KV heads. Less cache, less compute.
- **Sliding window:** 512 tokens local, with periodic global layers for long-range context.
- **MTP:** Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training.
- **Hybrid tokenizer:** Word-level where possible, char fallback for the weird stuff.
- **Word token loss boost:** 3x loss on multi-character tokens so the model actually learns words.
- **Response-start weighting:** First 20 tokens of assistant responses get 3x weight.

## Training

| Thing | Value |
| :--- | :--- |
| **Batch Size** | 48 |
| **Pretrain LR** | 8e-4 (min 1e-5) |
| **SFT LR** | 2e-4 (min 1e-5) |
| **Warmup** | 300 steps |
| **Weight Decay** | 0.02 |
| **Max Grad Norm** | 1.0 |
| **Checkpoint** | Every 1,000 steps |
| **Sampling** | Every 5,000 steps |

## Loss curve

![loss](model/loss_curve.png)

## Limitations

- Repeats itself.
- Knows almost nothing.
- Research only. Not an assistant.
- Sometimes hallucinates pipes.

---

*Built by [CompactAI](https://huggingface.co/CompactAI-O). We learn by failing.*