metadata
license: gpl-3.0
datasets:
- HuggingFaceFW/fineweb-edu
- mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
- tatsu-lab/alpaca
- databricks/databricks-dolly-15k
- TeichAI/Step-3.5-Flash-2600x
- TeichAI/convo-v1
language:
- en
tags:
- small
- glint
new_version: CompactAI-O/Glint-0.3
Glint-0.2
The pipe character incident. We do not talk about the pipe character incident.
Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs |fdish||||!@|.
Progress is not a straight line.
What you get
| File | What it is |
|---|---|
tokenizer.json |
Hybrid word/char tokenizer (~2,133 tokens) |
pretrain.pt |
Base pretrained checkpoint |
model.pt |
Instruction-tuned checkpoint (SFT) |
samples.jsonl |
Sample generations with metrics at checkpoints |
loss_curve.png |
Training loss across all phases |
Specs
| Thing | Value |
|---|---|
| Architecture | Transformer Decoder (GQA) |
| Parameters | ~700K |
| Context | 2,048 tokens |
| Sliding Window | 512 tokens |
| d_model | 128 |
| Unique Layers | 8 (tied to make 16 logical) |
| Heads | 4 |
| KV Heads | 2 |
| FFN | 224 |
| Vocab | ~2,133 (Hybrid Char + Word) |
| Norm | RMSNorm |
| Position | RoPE (25% fraction) |
| Activation | SwiGLU |
| Multi-Token Prediction | Horizons 2, 3, 4 |
Fancy tricks
- Weight-tied layers: 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective.
- GQA: 4 attention heads sharing 2 KV heads. Less cache, less compute.
- Sliding window: 512 tokens local, with periodic global layers for long-range context.
- MTP: Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training.
- Hybrid tokenizer: Word-level where possible, char fallback for the weird stuff.
- Word token loss boost: 3x loss on multi-character tokens so the model actually learns words.
- Response-start weighting: First 20 tokens of assistant responses get 3x weight.
Training
| Thing | Value |
|---|---|
| Batch Size | 48 |
| Pretrain LR | 8e-4 (min 1e-5) |
| SFT LR | 2e-4 (min 1e-5) |
| Warmup | 300 steps |
| Weight Decay | 0.02 |
| Max Grad Norm | 1.0 |
| Checkpoint | Every 1,000 steps |
| Sampling | Every 5,000 steps |
Loss curve
Limitations
- Repeats itself.
- Knows almost nothing.
- Research only. Not an assistant.
- Sometimes hallucinates pipes.
Built by CompactAI. We learn by failing.
