Upload README.md

80fc0da verified 1 day ago

2.76 kB

	---
	license: gpl-3.0
	datasets:
	- HuggingFaceFW/fineweb-edu
	- mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
	- tatsu-lab/alpaca
	- databricks/databricks-dolly-15k
	- TeichAI/Step-3.5-Flash-2600x
	- TeichAI/convo-v1
	language:
	- en
	tags:
	- small
	- glint
	new_version: CompactAI-O/Glint-0.3
	---

	# Glint-0.2

	> The pipe character incident. We do not talk about the pipe character incident.

	Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs `\|fdish\|\|\|\|!@\|`.

	Progress is not a straight line.

	## What you get

	\| File \| What it is \|
	\| :--- \| :--- \|
	\| `tokenizer.json` \| Hybrid word/char tokenizer (~2,133 tokens) \|
	\| `pretrain.pt` \| Base pretrained checkpoint \|
	\| `model.pt` \| Instruction-tuned checkpoint (SFT) \|
	\| `samples.jsonl` \| Sample generations with metrics at checkpoints \|
	\| `loss_curve.png` \| Training loss across all phases \|

	## Specs

	\| Thing \| Value \|
	\| :--- \| :--- \|
	\| Architecture \| Transformer Decoder (GQA) \|
	\| Parameters \| ~700K \|
	\| Context \| 2,048 tokens \|
	\| Sliding Window \| 512 tokens \|
	\| d_model \| 128 \|
	\| Unique Layers \| 8 (tied to make 16 logical) \|
	\| Heads \| 4 \|
	\| KV Heads \| 2 \|
	\| FFN \| 224 \|
	\| Vocab \| ~2,133 (Hybrid Char + Word) \|
	\| Norm \| RMSNorm \|
	\| Position \| RoPE (25% fraction) \|
	\| Activation \| SwiGLU \|
	\| Multi-Token Prediction \| Horizons 2, 3, 4 \|

	## Fancy tricks

	- Weight-tied layers: 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective.
	- GQA: 4 attention heads sharing 2 KV heads. Less cache, less compute.
	- Sliding window: 512 tokens local, with periodic global layers for long-range context.
	- MTP: Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training.
	- Hybrid tokenizer: Word-level where possible, char fallback for the weird stuff.
	- Word token loss boost: 3x loss on multi-character tokens so the model actually learns words.
	- Response-start weighting: First 20 tokens of assistant responses get 3x weight.

	## Training

	\| Thing \| Value \|
	\| :--- \| :--- \|
	\| Batch Size \| 48 \|
	\| Pretrain LR \| 8e-4 (min 1e-5) \|
	\| SFT LR \| 2e-4 (min 1e-5) \|
	\| Warmup \| 300 steps \|
	\| Weight Decay \| 0.02 \|
	\| Max Grad Norm \| 1.0 \|
	\| Checkpoint \| Every 1,000 steps \|
	\| Sampling \| Every 5,000 steps \|

	## Loss curve

	![loss](model/loss_curve.png)

	## Limitations

	- Repeats itself.
	- Knows almost nothing.
	- Research only. Not an assistant.
	- Sometimes hallucinates pipes.

	---

	Built by [CompactAI](https://huggingface.co/CompactAI-O). We learn by failing.

	---
	license: gpl-3.0
	datasets:
	- HuggingFaceFW/fineweb-edu
	- mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
	- tatsu-lab/alpaca
	- databricks/databricks-dolly-15k
	- TeichAI/Step-3.5-Flash-2600x
	- TeichAI/convo-v1
	language:
	- en
	tags:
	- small
	- glint
	new_version: CompactAI-O/Glint-0.3
	---

	# Glint-0.2

	> The pipe character incident. We do not talk about the pipe character incident.

	Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs `\|fdish\|\|\|\|!@\|`.

	Progress is not a straight line.

	## What you get

	\| File \| What it is \|
	\| :--- \| :--- \|
	\| `tokenizer.json` \| Hybrid word/char tokenizer (~2,133 tokens) \|
	\| `pretrain.pt` \| Base pretrained checkpoint \|
	\| `model.pt` \| Instruction-tuned checkpoint (SFT) \|
	\| `samples.jsonl` \| Sample generations with metrics at checkpoints \|
	\| `loss_curve.png` \| Training loss across all phases \|

	## Specs

	\| Thing \| Value \|
	\| :--- \| :--- \|
	\| Architecture \| Transformer Decoder (GQA) \|
	\| Parameters \| ~700K \|
	\| Context \| 2,048 tokens \|
	\| Sliding Window \| 512 tokens \|
	\| d_model \| 128 \|
	\| Unique Layers \| 8 (tied to make 16 logical) \|
	\| Heads \| 4 \|
	\| KV Heads \| 2 \|
	\| FFN \| 224 \|
	\| Vocab \| ~2,133 (Hybrid Char + Word) \|
	\| Norm \| RMSNorm \|
	\| Position \| RoPE (25% fraction) \|
	\| Activation \| SwiGLU \|
	\| Multi-Token Prediction \| Horizons 2, 3, 4 \|

	## Fancy tricks

	- Weight-tied layers: 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective.
	- GQA: 4 attention heads sharing 2 KV heads. Less cache, less compute.
	- Sliding window: 512 tokens local, with periodic global layers for long-range context.
	- MTP: Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training.
	- Hybrid tokenizer: Word-level where possible, char fallback for the weird stuff.
	- Word token loss boost: 3x loss on multi-character tokens so the model actually learns words.
	- Response-start weighting: First 20 tokens of assistant responses get 3x weight.

	## Training

	\| Thing \| Value \|
	\| :--- \| :--- \|
	\| Batch Size \| 48 \|
	\| Pretrain LR \| 8e-4 (min 1e-5) \|
	\| SFT LR \| 2e-4 (min 1e-5) \|
	\| Warmup \| 300 steps \|
	\| Weight Decay \| 0.02 \|
	\| Max Grad Norm \| 1.0 \|
	\| Checkpoint \| Every 1,000 steps \|
	\| Sampling \| Every 5,000 steps \|

	## Loss curve

	![loss](model/loss_curve.png)

	## Limitations

	- Repeats itself.
	- Knows almost nothing.
	- Research only. Not an assistant.
	- Sometimes hallucinates pipes.

	---

	Built by [CompactAI](https://huggingface.co/CompactAI-O). We learn by failing.