File size: 5,213 Bytes
73c78ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
979436a
73c78ab
 
 
 
979436a
 
73c78ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: mit
---
# LLM\_D3: A Sparse 350M Architecture Trained on 50B Tokens

This repository contains the implementation of **LLM\_D3**, a decoder-only Large Language Model trained from scratch on 50 billion tokens of the C4 English-only dataset. It features a modern, high-performance architecture optimized for efficiency, combining **Mixture of Experts (MoE)**, **Multi-head Latent Attention (MLA)**, and **Rotary Positional Embeddings (RoPE)**.

Designed for genuine generalization over rote memorization, the model was trained using a single-epoch pass, achieving a **33% zero-shot HellaSwag** score. Following instruction fine-tuning, it serves as a capable assistant with strong general reasoning and factual recall.

-----

## ๐Ÿ“Š Model Statistics

| Metric | Value |
| :--- | :--- |
| **Total Parameters** | 358.74M |
| **Active Parameters** | 171.96M |
| **Sparsity Ratio** | 52.06% |
| **Training Data** | 50B Tokens (C4 English) |
| **Architecture** | MLA + Sparse MoE + RoPE |

-----

## for the scipt
**github: firdavsus/LLM_D3**

## ๐Ÿ—๏ธ Architecture Details

The model utilizes a custom GPT implementation (`LLM_2.py`) with several key architectural innovations focused on compute efficiency and memory optimization.

### Multi-head Latent Attention (MLA)

To solve the memory bottleneck of the KV cache, LLM\_D3 implements **Multi-head Latent Attention**.

  * **Latent Compression**: Query and KV states are compressed into a lower-dimensional latent space before being up-projected for attention calculations.
  * **Throughput**: This reduces the memory footprint of the KV cache during inference while maintaining the performance of standard Multi-Head Attention.

### Sparse Mixture of Experts (MoE)

LLM\_D3 uses a sparse MoE architecture for 19 out of its 24 layers.

  * **Expert Configuration**: Each MoE layer contains **6 experts**, with a **Top-2** routing mechanism active for every token.
  * **Hybrid Stability Sandwich**: For improved training stability, the **first 3 layers** and **last 2 layers** are initialized as standard dense MLP blocks rather than MoE layers.
  * **Routing**: Uses a noisy Top-K router with auxiliary load-balancing and router z-loss to prevent expert collapse and ensure balanced utilization across the 19 MoE blocks.

### Positional Encoding

  * **RoPE**: Rotary Positional Embeddings are applied to ensure better handling of long-range dependencies and superior sequence positioning compared to traditional learned embeddings.

-----

## ๐Ÿ“ˆ Training & Evaluation

### Pre-training Setup

  * **Policy**: Single-epoch pass on 50B tokens (no repetition) to prioritize feature extraction and generalization.
  * **Batch Size**: 1M tokens effective batch size for high gradient stability.
  * **Schedule**: Warmup-Stable-Decay (WSD) / Stepped Cosine Decay with a 1,000-step warmup.
  * **Optimizer**: AdamW with hardware-optimized settings.

### Benchmarks

| Benchmark | Setting | Score |
| :--- | :--- | :--- |
| **HellaSwag** | Zero-shot | **33%** |

### Fine-tuning

Fine-tuned on the `alpaca-cleaned` dataset using an Instruction-Input-Response format.

  * **Strengths**: Strong general reasoning, factual consistency, and instruction adherence.
  * **Known Limitations**: The model currently struggles with complex arithmetic. Additionally, an initialization anomaly in the final 2 layers resulted in a signal spike at the end of the network; while the model remains functional and capable, this is a known area for future refinement.

-----

## ๐Ÿ–ผ๏ธ Visualizations

### Pre-Training Curves
![Pre-Training](training_curves_with_eval.png)

*50k steps on a 50B token corpus with 1M token effective batch size.*

### Diagnostics & Utilization
![Model-analysis](full_diagnostics.png)
![Model-analysis](weight_histograms.png)
*Visualizing weight distribution and expert utilization. Current routing shows healthy balance with utilization under 33%.*

-----

## ๐Ÿ› ๏ธ Usage

### Inference

Interact with the model using the `test.py` script, which includes Top-K, Top-P, and repetition penalty sampling.

```bash
python test.py
```

### Fine-tuning

To replicate the instruction tuning on your own dataset:

1.  Format your data following the Alpaca template in `fine_tune.py`.
2.  Execute:

<!-- end list -->

```bash
python fine_tune.py
```

-----

## ๐Ÿ“‚ Repository Structure

  * `LLM_2.py`: Core architecture (MLA, MoE, RoPE).
  * `train.py`: Pre-training logic and WSD scheduler.
  * `fine_tune.py`: Instruction tuning implementation.
  * `manager.py`: MoE auxiliary loss tracking.
  * `check_params.py`: Active vs. total parameter counter.
  * `eval.py`: HellaSwag evaluation suite.
  * `analysis.py` / `show.py`: Diagnostic and visualization tools.

-----

*Note: This model was developed as a research exploration into efficient sparse architectures. Verify all mathematical outputs manually.*

### References

  * [nanoMoE Implementation](https://www.google.com/search?q=https://github.com/avm-avm/nanoMoE)
  * [MLA Implementation Guide](https://medium.com/@atulit23/implementing-multi-head-latent-attention-from-scratch-in-python-1e14d03fbc91)
  * [DeepSeek-V3 Research (MoE/MLA Foundations)](https://arxiv.org/abs/2412.19437)