SlimFactoryHub
/

SlimMoE-250M-base

 ---
 language:
 - en
+tags:
+- MoE
+- Text-Generation
+- Instruction Following
+- VGQA
+datasets:
+- HuggingFaceFW/fineweb-edu
+- HuggingFaceH4/ultrachat_200k
+- cais/mmlu
+- HuggingFaceTB/OpenHermes-2.5-H4
+license: apache-2.0
 pipeline_tag: text-generation
+library_name: transformers
+---
+# SlimMoE-250M
+**SlimMoE-250M** is a 250M parameter Mixture-of-Experts (MoE) language model developed by the **SlimFactory team**.This model was trained to **experiment with VGQA-style attention mechanisms and NoPE/RoPE positional strategies in a small-parameter MoE setting**, focusing on architectural feasibility and training stability rather than scale or benchmark maximization.
+## Motivation
+This work explores the following research question:
+> **Can a small (<500M) MoE model effectively support VGQA-style attention mechanisms and alternative positional encodings under constrained compute?**
+SlimMoE-250M was designed to study:
+- MoE routing behavior at small scales
+- VGQA-style attention mechanisms
+- NoPE / RoPE compatibility in MoE architectures
+- Quality vs. efficiency trade-offs under limited data and GPU availability
+## Model Summary
+| Property | Value |
+|--------|------|
+| Parameters | **250M** |
+| Architecture | **SlimMoEForCausalLM** |
+| Experts | **4** |
+| Layers | **16** |
+| Hidden Size | **768** |
+| FFN Size | **1536** |
+| Attention Heads | **12** |
+| Max Context Length | **2048** |
+| Routing | **Adaptive MoE Routing** |
+| Dropout | **0.1** |
+| Precision | **float32** |
+| Vocabulary Size | **50,257** |
+## Training Details
+### Pretraining
+This phase focused on **general language modeling** using high-quality educational data.
+- **Dataset**: HuggingFaceFW/fineweb-edu
+- **Split**: `sample-10BT`
+- **Tokens Used**: **5.2B**
+- **Duration**: **7 days 16 hours**
+- **GPU**: **48GB NVIDIA A100**
+### Fine-Tuning Phase-1 (SFT – VGQA / Instruction)
+This stage introduces **VGQA-style instruction supervision** and conversational alignment.
+- **Dataset**: HuggingFaceH4/ultrachat_200k
+- **Split**: `train_sft`
+- **Duration**: **8 days 8 hours**
+- **GPU**: **80GB NVIDIA A100**
+### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
+Used to improve **domain knowledge and reasoning performance**.
+- **Dataset**: cais/mmlu
+- **Split**: `auxiliary_train`
+- **Duration**: **8 days 11 hours**
+- **GPU**: **48GB NVIDIA A100**
+### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
+Focused on **response quality, instruction clarity, and consistency**.
+- **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
+- **Duration**: **5 days 1 hour**
+- **GPU**: **48GB NVIDIA A100**
+## VGQA & Positional Encoding Experiments
+- The model was trained using a **VGQA-style attention mechanism**.
+- Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**.
+- The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance.
+**Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.**
+## Known Issues & Constraints
+- **Dataset limitations**: Limited diversity and scale compared to large foundation models
+- **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
+- **No RLHF applied**
+- **English-centric data distribution**
+These factors directly influenced training duration and final model behavior.
+## Intended Use
+This model is released **strictly for research and experimental purposes**.
+- Studying **small-scale MoE architectures**
+- Exploring **VGQA-style attention mechanisms**
+- Evaluating **NoPE / RoPE behavior in MoE models**
+- Educational and exploratory research
+**Not intended for production use.**
+## Acknowledgements
+We would like to thank the dataset providers and the open-source community whose contributions made this work possible.
+- **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
+- **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining.
+- **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
+- **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
+- **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
+We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
+## Contact
+Please use the Hugging Face **Discussions** tab to connect.