HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 634k • 1.1k
The pipe character incident. We do not talk about the pipe character incident.
Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs |fdish||||!@|.
Progress is not a straight line.
| File | What it is |
|---|---|
tokenizer.json |
Hybrid word/char tokenizer (~2,133 tokens) |
pretrain.pt |
Base pretrained checkpoint |
model.pt |
Instruction-tuned checkpoint (SFT) |
samples.jsonl |
Sample generations with metrics at checkpoints |
loss_curve.png |
Training loss across all phases |
| Thing | Value |
|---|---|
| Architecture | Transformer Decoder (GQA) |
| Parameters | ~700K |
| Context | 2,048 tokens |
| Sliding Window | 512 tokens |
| d_model | 128 |
| Unique Layers | 8 (tied to make 16 logical) |
| Heads | 4 |
| KV Heads | 2 |
| FFN | 224 |
| Vocab | ~2,133 (Hybrid Char + Word) |
| Norm | RMSNorm |
| Position | RoPE (25% fraction) |
| Activation | SwiGLU |
| Multi-Token Prediction | Horizons 2, 3, 4 |
| Thing | Value |
|---|---|
| Batch Size | 48 |
| Pretrain LR | 8e-4 (min 1e-5) |
| SFT LR | 2e-4 (min 1e-5) |
| Warmup | 300 steps |
| Weight Decay | 0.02 |
| Max Grad Norm | 1.0 |
| Checkpoint | Every 1,000 steps |
| Sampling | Every 5,000 steps |
Built by CompactAI. We learn by failing.