Upload README.md

80fc0da verified about 21 hours ago

2.76 kB

license: gpl-3.0
datasets:
  - HuggingFaceFW/fineweb-edu
  - mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
  - tatsu-lab/alpaca
  - databricks/databricks-dolly-15k
  - TeichAI/Step-3.5-Flash-2600x
  - TeichAI/convo-v1
language:
  - en
tags:
  - small
  - glint
new_version: CompactAI-O/Glint-0.3

Glint-0.2

The pipe character incident. We do not talk about the pipe character incident.

Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs |fdish||||!@|.

Progress is not a straight line.

What you get

File	What it is
`tokenizer.json`	Hybrid word/char tokenizer (~2,133 tokens)
`pretrain.pt`	Base pretrained checkpoint
`model.pt`	Instruction-tuned checkpoint (SFT)
`samples.jsonl`	Sample generations with metrics at checkpoints
`loss_curve.png`	Training loss across all phases

Specs

Thing	Value
Architecture	Transformer Decoder (GQA)
Parameters	~700K
Context	2,048 tokens
Sliding Window	512 tokens
d_model	128
Unique Layers	8 (tied to make 16 logical)
Heads	4
KV Heads	2
FFN	224
Vocab	~2,133 (Hybrid Char + Word)
Norm	RMSNorm
Position	RoPE (25% fraction)
Activation	SwiGLU
Multi-Token Prediction	Horizons 2, 3, 4

Fancy tricks

Weight-tied layers: 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective.
GQA: 4 attention heads sharing 2 KV heads. Less cache, less compute.
Sliding window: 512 tokens local, with periodic global layers for long-range context.
MTP: Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training.
Hybrid tokenizer: Word-level where possible, char fallback for the weird stuff.
Word token loss boost: 3x loss on multi-character tokens so the model actually learns words.
Response-start weighting: First 20 tokens of assistant responses get 3x weight.

Training

Thing	Value
Batch Size	48
Pretrain LR	8e-4 (min 1e-5)
SFT LR	2e-4 (min 1e-5)
Warmup	300 steps
Weight Decay	0.02
Max Grad Norm	1.0
Checkpoint	Every 1,000 steps
Sampling	Every 5,000 steps

Loss curve

Limitations

Repeats itself.
Knows almost nothing.
Research only. Not an assistant.
Sometimes hallucinates pipes.

Built by CompactAI. We learn by failing.