--- language: - en tags: - MoE - Text-Generation - Instruction Following - VGQA - Research - SLM datasets: - HuggingFaceFW/fineweb-edu - HuggingFaceH4/ultrachat_200k - cais/mmlu - HuggingFaceTB/OpenHermes-2.5-H4 license: apache-2.0 pipeline_tag: text-generation library_name: transformers base_model: - SlimFactoryHub/SlimMoE-250M-SFT-v2 --- # SlimMoE-250M-SFT-instruct **SlimMoE-250M-instruct** is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases. The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints. ## Motivation This work explores the following research question: > **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?** SlimMoE-250M was designed to study: - MoE routing behavior at small scales - VGQA-style attention mechanisms - NoPE / RoPE compatibility in MoE architectures - Quality vs. efficiency trade-offs under limited data and GPU availability ## Model Summary | Property | Value | |--------|------| | Parameters | **250M** | | Architecture | **SlimMoEForCausalLM** | | Experts | **4** | | Layers | **16** | | Hidden Size | **768** | | FFN Size | **1536** | | Attention Heads | **12** | | Max Context Length | **2048** | | Routing | **Adaptive MoE Routing** | | Dropout | **0.1** | | Precision | **float32** | | Vocabulary Size | **50,257** | ## Training Details ### Pretraining This phase focused on **general language modeling** using high-quality educational data. - **Dataset**: HuggingFaceFW/fineweb-edu - **Split**: `sample-10BT` - **Tokens Used**: **5.2B** - **Duration**: **7 days 16 hours** - **GPU**: **48GB NVIDIA A100** - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf ### Fine-Tuning Phase-1 (SFT – Instruction Tuning) This stage introduces **instruction supervision** and conversational alignment. - **Dataset**: HuggingFaceH4/ultrachat_200k - **Split**: `train_sft` - **Duration**: **8 days 8 hours** - **GPU**: **80GB NVIDIA A100** - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf ### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning) Used to improve **domain knowledge and reasoning performance**. - **Dataset**: cais/mmlu - **Split**: `auxiliary_train` - **Duration**: **8 days 11 hours** - **GPU**: **48GB NVIDIA A100** - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf ### Fine-Tuning Phase-3 (SFT – Instruction Refinement) Focused on **response quality, instruction clarity, and consistency**. - **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4 - **Duration**: **5 days 1 hour** - **GPU**: **48GB NVIDIA A100** - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf ## VGQA & Positional Encoding Experiments - The model was trained using a **VGQA-style attention mechanism**. - Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**. - The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance. **Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.** ## Known Issues & Constraints - **Dataset limitations**: Limited diversity and scale compared to large foundation models - **GPU constraints**: Training conducted under restricted GPU availability and memory budgets - **Loss fluctuations** - **No RLHF applied** - **English-centric data distribution** These factors directly influenced training duration and final model behavior. ## Intended Use - Studying **small-scale MoE architectures** - Exploring **VGQA-style attention mechanisms** - Evaluating **NoPE / RoPE behavior in MoE models** - Educational and exploratory research ## Acknowledgements We would like to thank the dataset providers and the open-source community whose contributions made this work possible. - **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model. - **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining. - **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning. - **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision. - **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase. - **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress. - Additionally, we drew valuable insights from **The Smol Training Playbook: The Secrets to Building World-Class LLMs**, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow. Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies. ## Contact Please use the Hugging Face **Discussions** tab to connect.