CompactAI commited on
Commit
80fc0da
·
verified ·
1 Parent(s): 8971fc8

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -54
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: apache-2.0
3
  datasets:
4
  - HuggingFaceFW/fineweb-edu
5
  - mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
@@ -11,63 +11,60 @@ language:
11
  - en
12
  tags:
13
  - small
14
- - haiku
15
- new_version: CompactAI-O/TMLM-Haiku-2
16
  ---
17
- # TinyMemoryLM (Haiku)
18
 
19
- > **⚠️ IMPORTANT NOTICE**
20
- > 1. **The model is really dumb.** This is a sub-1M parameter research model designed for experimentation, not production use.
21
- > 2. **Do not expect it to answer any questions.** It is prone to repetition, hallucination, and format collapse.
22
 
23
- ## Overview
24
 
25
- TinyMemoryLM is an ultra-lightweight language model optimized for edge cases and architectural experimentation. Despite its small footprint, it incorporates several novel training innovations aimed at stabilizing tiny model convergence, including hybrid tokenization, loss boosting strategies, and context-aware relevance modeling.
26
 
27
- This release includes both **Pretrained Weights** (base language modeling, completion) and **Instruction Weights** (fine-tuned for chat).
28
 
29
- ## Files Provided
30
 
31
- | File | Description |
32
  | :--- | :--- |
33
- | `tokenizer.json` | Hybrid word/character tokenizer vocabulary (2,133 tokens). |
34
- | `pretrain.pt` | Base pretrained checkpoint (language modeling). |
35
- | `model.pt` | Instruction-tuned checkpoint (SFT/Chat). |
36
- | `samples.jsonl` | Sample generations with NLL/PPL metrics at checkpoints. |
37
- | `loss_curve.png` | Training loss progression across all phases. |
38
 
39
- ## Model Specifications
40
 
41
- | Parameter | Value |
42
  | :--- | :--- |
43
  | **Architecture** | Transformer Decoder (GQA) |
44
  | **Parameters** | ~700K |
45
- | **Context Length** | 2,048 tokens |
46
  | **Sliding Window** | 512 tokens |
47
- | **Dimensions** | `d_model=128`, `unique_layers=8`, `logical_layers=16`, `heads=4`, `kv_heads=2`, `ffn=224` |
48
- | **Vocabulary** | ~2,133 tokens (Hybrid Char + Word) |
49
- | **Normalization** | RMSNorm |
50
- | **Embeddings** | Rotary Embeddings (RoPE, 25% fraction) |
 
 
 
 
51
  | **Activation** | SwiGLU |
52
- | **Multi-Token Prediction** | Horizons at 2, 3, 4 |
53
 
54
- ## Architecture Highlights
55
 
56
- TinyMemoryLM implements several research-focused modifications to standard transformer architectures:
 
 
 
 
 
 
57
 
58
- * **Weight-Tied Logical Layers:** 8 unique transformer blocks are repeated to create 16 logical layers (every 3rd layer uses global attention vs. sliding window), drastically reducing parameter count.
59
- * **Grouped-Query Attention (GQA):** 4 attention heads share 2 KV heads, reducing KV cache and compute.
60
- * **Sliding Window Attention:** Local attention within 512-token windows, with periodic global layers for long-range context.
61
- * **Multi-Token Prediction (MTP):** Auxiliary prediction heads at horizons 2, 3, and 4 with dedicated adapters and norms, weighted at 0.3 during training.
62
- * **Hybrid Tokenizer:** Combines character-level fallback with frequent word tokens to balance compression and vocabulary size.
63
- * **Word Token Loss Boosting:** Upweights loss signals for multi-character tokens (3x) to prevent the model from ignoring them in favor of character-level spelling.
64
- * **Response-Start Weighting:** Prioritizes the first 20 tokens of assistant responses (3x weight) to improve prompt conditioning.
65
- * **Embedding Scale:** Learned scaling factor applied to token embeddings for improved training dynamics.
66
 
67
-
68
- ### Training Hyperparameters
69
-
70
- | Parameter | Value |
71
  | :--- | :--- |
72
  | **Batch Size** | 48 |
73
  | **Pretrain LR** | 8e-4 (min 1e-5) |
@@ -75,26 +72,20 @@ TinyMemoryLM implements several research-focused modifications to standard trans
75
  | **Warmup** | 300 steps |
76
  | **Weight Decay** | 0.02 |
77
  | **Max Grad Norm** | 1.0 |
78
- | **MTP Weight** | 0.3 |
79
- | **Word Token Loss Boost** | 3.0x |
80
- | **Response-Start Boost** | 3.0x (first 20 tokens) |
81
- | **Checkpointing** | Every 1,000 steps |
82
  | **Sampling** | Every 5,000 steps |
83
 
84
- ## Training Loss Curve
85
-
86
- Training loss progression across pretrain and SFT phases:
87
-
88
- ![Training Loss Curve](model/loss_curve.png)
89
 
90
- ## Limitations & Expectations
91
 
92
- Please manage your expectations when using TinyMemoryLM:
93
 
94
- * **Repetition:** Tiny models are prone to collapsing into repetitive token loops.
95
- * **Knowledge:** The model has limited world knowledge due to parameter constraints.
96
- * **Usage:** This model is intended for **research, educational purposes, and architectural benchmarking**. It is not suitable for assistant tasks or reliable information retrieval.
 
97
 
98
  ---
99
 
100
- *Generated for research purposes. Use responsibly.*
 
1
  ---
2
+ license: gpl-3.0
3
  datasets:
4
  - HuggingFaceFW/fineweb-edu
5
  - mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
 
11
  - en
12
  tags:
13
  - small
14
+ - glint
15
+ new_version: CompactAI-O/Glint-0.3
16
  ---
 
17
 
18
+ # Glint-0.2
 
 
19
 
20
+ > The pipe character incident. We do not talk about the pipe character incident.
21
 
22
+ Glint-0.2 was supposed to be the smart one. It has weight-tied layers, grouped-query attention, sliding windows, multi-token prediction heads. Fancy stuff. And sometimes it still outputs `|fdish||||!@|`.
23
 
24
+ Progress is not a straight line.
25
 
26
+ ## What you get
27
 
28
+ | File | What it is |
29
  | :--- | :--- |
30
+ | `tokenizer.json` | Hybrid word/char tokenizer (~2,133 tokens) |
31
+ | `pretrain.pt` | Base pretrained checkpoint |
32
+ | `model.pt` | Instruction-tuned checkpoint (SFT) |
33
+ | `samples.jsonl` | Sample generations with metrics at checkpoints |
34
+ | `loss_curve.png` | Training loss across all phases |
35
 
36
+ ## Specs
37
 
38
+ | Thing | Value |
39
  | :--- | :--- |
40
  | **Architecture** | Transformer Decoder (GQA) |
41
  | **Parameters** | ~700K |
42
+ | **Context** | 2,048 tokens |
43
  | **Sliding Window** | 512 tokens |
44
+ | **d_model** | 128 |
45
+ | **Unique Layers** | 8 (tied to make 16 logical) |
46
+ | **Heads** | 4 |
47
+ | **KV Heads** | 2 |
48
+ | **FFN** | 224 |
49
+ | **Vocab** | ~2,133 (Hybrid Char + Word) |
50
+ | **Norm** | RMSNorm |
51
+ | **Position** | RoPE (25% fraction) |
52
  | **Activation** | SwiGLU |
53
+ | **Multi-Token Prediction** | Horizons 2, 3, 4 |
54
 
55
+ ## Fancy tricks
56
 
57
+ - **Weight-tied layers:** 8 unique transformer blocks repeated to make 16 layers. Every 3rd layer gets global attention instead of sliding window. Cheap and surprisingly effective.
58
+ - **GQA:** 4 attention heads sharing 2 KV heads. Less cache, less compute.
59
+ - **Sliding window:** 512 tokens local, with periodic global layers for long-range context.
60
+ - **MTP:** Extra prediction heads at offsets 2, 3, and 4. Weighted at 0.3 during training.
61
+ - **Hybrid tokenizer:** Word-level where possible, char fallback for the weird stuff.
62
+ - **Word token loss boost:** 3x loss on multi-character tokens so the model actually learns words.
63
+ - **Response-start weighting:** First 20 tokens of assistant responses get 3x weight.
64
 
65
+ ## Training
 
 
 
 
 
 
 
66
 
67
+ | Thing | Value |
 
 
 
68
  | :--- | :--- |
69
  | **Batch Size** | 48 |
70
  | **Pretrain LR** | 8e-4 (min 1e-5) |
 
72
  | **Warmup** | 300 steps |
73
  | **Weight Decay** | 0.02 |
74
  | **Max Grad Norm** | 1.0 |
75
+ | **Checkpoint** | Every 1,000 steps |
 
 
 
76
  | **Sampling** | Every 5,000 steps |
77
 
78
+ ## Loss curve
 
 
 
 
79
 
80
+ ![loss](model/loss_curve.png)
81
 
82
+ ## Limitations
83
 
84
+ - Repeats itself.
85
+ - Knows almost nothing.
86
+ - Research only. Not an assistant.
87
+ - Sometimes hallucinates pipes.
88
 
89
  ---
90
 
91
+ *Built by [CompactAI](https://huggingface.co/CompactAI-O). We learn by failing.*