--- license: mit --- # LLM\_D3: A Sparse 350M Architecture Trained on 50B Tokens This repository contains the implementation of **LLM\_D3**, a decoder-only Large Language Model trained from scratch on 50 billion tokens of the C4 English-only dataset. It features a modern, high-performance architecture optimized for efficiency, combining **Mixture of Experts (MoE)**, **Multi-head Latent Attention (MLA)**, and **Rotary Positional Embeddings (RoPE)**. Designed for genuine generalization over rote memorization, the model was trained using a single-epoch pass, achieving a **33% zero-shot HellaSwag** score. Following instruction fine-tuning, it serves as a capable assistant with strong general reasoning and factual recall. ----- ## 📊 Model Statistics | Metric | Value | | :--- | :--- | | **Total Parameters** | 358.74M | | **Active Parameters** | 171.96M | | **Sparsity Ratio** | 52.06% | | **Training Data** | 50B Tokens (C4 English) | | **Architecture** | MLA + Sparse MoE + RoPE | ----- ## for the scipt **github: firdavsus/LLM_D3** ## 🏗️ Architecture Details The model utilizes a custom GPT implementation (`LLM_2.py`) with several key architectural innovations focused on compute efficiency and memory optimization. ### Multi-head Latent Attention (MLA) To solve the memory bottleneck of the KV cache, LLM\_D3 implements **Multi-head Latent Attention**. * **Latent Compression**: Query and KV states are compressed into a lower-dimensional latent space before being up-projected for attention calculations. * **Throughput**: This reduces the memory footprint of the KV cache during inference while maintaining the performance of standard Multi-Head Attention. ### Sparse Mixture of Experts (MoE) LLM\_D3 uses a sparse MoE architecture for 19 out of its 24 layers. * **Expert Configuration**: Each MoE layer contains **6 experts**, with a **Top-2** routing mechanism active for every token. * **Hybrid Stability Sandwich**: For improved training stability, the **first 3 layers** and **last 2 layers** are initialized as standard dense MLP blocks rather than MoE layers. * **Routing**: Uses a noisy Top-K router with auxiliary load-balancing and router z-loss to prevent expert collapse and ensure balanced utilization across the 19 MoE blocks. ### Positional Encoding * **RoPE**: Rotary Positional Embeddings are applied to ensure better handling of long-range dependencies and superior sequence positioning compared to traditional learned embeddings. ----- ## 📈 Training & Evaluation ### Pre-training Setup * **Policy**: Single-epoch pass on 50B tokens (no repetition) to prioritize feature extraction and generalization. * **Batch Size**: 1M tokens effective batch size for high gradient stability. * **Schedule**: Warmup-Stable-Decay (WSD) / Stepped Cosine Decay with a 1,000-step warmup. * **Optimizer**: AdamW with hardware-optimized settings. ### Benchmarks | Benchmark | Setting | Score | | :--- | :--- | :--- | | **HellaSwag** | Zero-shot | **33%** | ### Fine-tuning Fine-tuned on the `alpaca-cleaned` dataset using an Instruction-Input-Response format. * **Strengths**: Strong general reasoning, factual consistency, and instruction adherence. * **Known Limitations**: The model currently struggles with complex arithmetic. Additionally, an initialization anomaly in the final 2 layers resulted in a signal spike at the end of the network; while the model remains functional and capable, this is a known area for future refinement. ----- ## 🖼️ Visualizations ### Pre-Training Curves ![Pre-Training](training_curves_with_eval.png) *50k steps on a 50B token corpus with 1M token effective batch size.* ### Diagnostics & Utilization ![Model-analysis](full_diagnostics.png) ![Model-analysis](weight_histograms.png) *Visualizing weight distribution and expert utilization. Current routing shows healthy balance with utilization under 33%.* ----- ## 🛠️ Usage ### Inference Interact with the model using the `test.py` script, which includes Top-K, Top-P, and repetition penalty sampling. ```bash python test.py ``` ### Fine-tuning To replicate the instruction tuning on your own dataset: 1. Format your data following the Alpaca template in `fine_tune.py`. 2. Execute: ```bash python fine_tune.py ``` ----- ## 📂 Repository Structure * `LLM_2.py`: Core architecture (MLA, MoE, RoPE). * `train.py`: Pre-training logic and WSD scheduler. * `fine_tune.py`: Instruction tuning implementation. * `manager.py`: MoE auxiliary loss tracking. * `check_params.py`: Active vs. total parameter counter. * `eval.py`: HellaSwag evaluation suite. * `analysis.py` / `show.py`: Diagnostic and visualization tools. ----- *Note: This model was developed as a research exploration into efficient sparse architectures. Verify all mathematical outputs manually.* ### References * [nanoMoE Implementation](https://www.google.com/search?q=https://github.com/avm-avm/nanoMoE) * [MLA Implementation Guide](https://medium.com/@atulit23/implementing-multi-head-latent-attention-from-scratch-in-python-1e14d03fbc91) * [DeepSeek-V3 Research (MoE/MLA Foundations)](https://arxiv.org/abs/2412.19437)