SlimFactoryHub
/

SlimMoE-250M-instruct

+---
+language:
+- en
+tags:
+- MoE
+- Text-Generation
+- Instruction Following
+- VGQA
+- Research
+- SLM
+datasets:
+- HuggingFaceFW/fineweb-edu
+- HuggingFaceH4/ultrachat_200k
+- cais/mmlu
+- HuggingFaceTB/OpenHermes-2.5-H4
+license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
+base_model:
+- SlimFactoryHub/SlimMoE-250M-SFT-v2
+---
+# SlimMoE-250M-SFT-instruct
+**SlimMoE-250M-instruct** is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases.
+The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.
+## Motivation
+This work explores the following research question:
+> **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?**
+SlimMoE-250M was designed to study:
+- MoE routing behavior at small scales
+- VGQA-style attention mechanisms
+- NoPE / RoPE compatibility in MoE architectures
+- Quality vs. efficiency trade-offs under limited data and GPU availability
+## Model Summary
+| Property | Value |
+|--------|------|
+| Parameters | **250M** |
+| Architecture | **SlimMoEForCausalLM** |
+| Experts | **4** |
+| Layers | **16** |
+| Hidden Size | **768** |
+| FFN Size | **1536** |
+| Attention Heads | **12** |
+| Max Context Length | **2048** |
+| Routing | **Adaptive MoE Routing** |
+| Dropout | **0.1** |
+| Precision | **float32** |
+| Vocabulary Size | **50,257** |
+## Training Details
+### Pretraining
+This phase focused on **general language modeling** using high-quality educational data.
+- **Dataset**: HuggingFaceFW/fineweb-edu
+- **Split**: `sample-10BT`
+- **Tokens Used**: **5.2B**
+- **Duration**: **7 days 16 hours**
+- **GPU**: **48GB NVIDIA A100**
+- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf
+### Fine-Tuning Phase-1 (SFT – Instruction Tuning)
+This stage introduces **instruction supervision** and conversational alignment.
+- **Dataset**: HuggingFaceH4/ultrachat_200k
+- **Split**: `train_sft`
+- **Duration**: **8 days 8 hours**
+- **GPU**: **80GB NVIDIA A100**
+- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf
+### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
+Used to improve **domain knowledge and reasoning performance**.
+- **Dataset**: cais/mmlu
+- **Split**: `auxiliary_train`
+- **Duration**: **8 days 11 hours**
+- **GPU**: **48GB NVIDIA A100**
+- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf
+### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
+Focused on **response quality, instruction clarity, and consistency**.
+- **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
+- **Duration**: **5 days 1 hour**
+- **GPU**: **48GB NVIDIA A100**
+- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf
+## VGQA & Positional Encoding Experiments
+- The model was trained using a **VGQA-style attention mechanism**.
+- Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**.
+- The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance.
+**Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.**
+## Known Issues & Constraints
+- **Dataset limitations**: Limited diversity and scale compared to large foundation models
+- **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
+- **Loss fluctuations**
+- **No RLHF applied**
+- **English-centric data distribution**
+These factors directly influenced training duration and final model behavior.
+## Intended Use
+- Studying **small-scale MoE architectures**
+- Exploring **VGQA-style attention mechanisms**
+- Evaluating **NoPE / RoPE behavior in MoE models**
+- Educational and exploratory research
+## Acknowledgements
+We would like to thank the dataset providers and the open-source community whose contributions made this work possible.
+- **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
+- **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining.
+- **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
+- **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
+- **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
+- **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress.
+We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
+## Contact
+Please use the Hugging Face **Discussions** tab to connect.