File size: 2,645 Bytes
4be8ffd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# QuantMobileLLM โ€” Lightweight GPT-Style Language Model

MobileLLM is a **lightweight GPT-style language model** designed for efficiency, fast inference, and small deployment environments.  
Itโ€™s trained on **FineWeb-MINI** and optimized with **modern attention techniques**.

---

## ๐Ÿš€ Model Highlights
- **Architecture**: Decoder-only GPT-style transformer
- **Parameters**: ~17M (6 layers, 8 heads, 256 embedding dim)
- **Context Length**: 512 tokens
- **Vocabulary Size**: 50,304 tokens
- **Precision**: Supports both `fp16` and `bf16`
- **Optimized for**: Small GPUs, mobile inference,

---

## ๐Ÿง  Architecture Details

| **Component**      | **Value** |
|--------------------|-----------|
| Layers            | 6 |
| Attention Heads    | 8 |
| KV Heads          | 4 |
| Embedding Dim     | 256 |
| Context Length    | 512 |
| Vocab Size        | 50,304 |
| Attention Type    | Multi-Query Attention |
| Norm Type         | RMSNorm |
| Position Encoding | Rotary Position Embeddings (RoPE) |
| FFN Activation    | SwiGLU (`silu`) |

### ๐Ÿ”น Key Optimizations
- **RMSNorm** โ†’ Improves training stability over LayerNorm.
- **Multi-Query Attention** โ†’ Reduces KV-cache size โ†’ lower memory footprint.
- **Rotary Embeddings (RoPE)** โ†’ Better handling of long context windows.
- **`safetensors` checkpoints** โ†’ Faster & safer loading.

---

## ๐Ÿ“Š Training Setup

| **Property**            | **Value** |
|------------------------|-----------|
| Dataset                | [FineWeb-MINI](https://huggingface.co/datasets/AryanNsc/FineWeb-Mini) |
| Tokens Trained         | ~100M |
| Optimizer             | AdamW |
| Learning Rate         | 6e-4 (cosine decay) |
| Warmup Steps          | 100 |
| Batch Size            | 64 ร— 2 grad accum |
| Effective Batch Size  | 128 |
| Mixed Precision       | `fp16` / `bf16` (auto-detect) |
| Distributed Training  | DDP |
| Logging               | Weights & Biases (`wandb`) |
| Checkpoint Format     | `.safetensors` |

---

## ๐Ÿงฉ Model Checkpoints

| **Step** | **Filename** | **Format** |
|----------|------------|------------|
| Final    | `mobile_llm_final.safetensors` | safetensors |
| Intermediate | `checkpoints/mobile_llm_step_<step>.safetensors` | safetensors |

---

## ๐Ÿ”ฎ Roadmap
- [x] Train **MobileLLM** on **FineWeb-MINI**
- [x] Add **multi-query attention**
- [x] Export **safetensors** checkpoints
- [ ] Quantized **int8** & **int4** inference
- [ ] Expand training on **FineWeb-1B**

---

## ๐Ÿ“œ License
This model is licensed under the [MIT License](LICENSE).

---

## ๐ŸŒ Links
- **Github** โ†’ [MobileLLM training code](https://github.com/Guney-olu/Quantgpt)