SlimFactoryHub
/

SlimMoE-250M-base

@@ -6,6 +6,8 @@ tags:
 - Text-Generation
 - Instruction Following
 - VGQA
 datasets:
 - HuggingFaceFW/fineweb-edu
 - HuggingFaceH4/ultrachat_200k
@@ -25,7 +27,7 @@ library_name: transformers
 This work explores the following research question:
-> **Can a small (<500M) MoE model effectively support VGQA-style attention mechanisms and alternative positional encodings under constrained compute?**
 SlimMoE-250M was designed to study:
@@ -65,17 +67,20 @@ This phase focused on **general language modeling** using high-quality education
 - **Split**: `sample-10BT`
 - **Tokens Used**: **5.2B**
 - **Duration**: **7 days 16 hours**
-- **GPU**: **48GB NVIDIA A100**
-### Fine-Tuning Phase-1 (SFT – VGQA / Instruction)
-This stage introduces **VGQA-style instruction supervision** and conversational alignment.
 - **Dataset**: HuggingFaceH4/ultrachat_200k
 - **Split**: `train_sft`
 - **Duration**: **8 days 8 hours**
-- **GPU**: **80GB NVIDIA A100**
 ### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
@@ -84,7 +89,9 @@ Used to improve **domain knowledge and reasoning performance**.
 - **Dataset**: cais/mmlu
 - **Split**: `auxiliary_train`
 - **Duration**: **8 days 11 hours**
-- **GPU**: **48GB NVIDIA A100**
 ### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
@@ -92,7 +99,8 @@ Focused on **response quality, instruction clarity, and consistency**.
 - **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
 - **Duration**: **5 days 1 hour**
-- **GPU**: **48GB NVIDIA A100**
 ## VGQA & Positional Encoding Experiments
@@ -106,7 +114,8 @@ Focused on **response quality, instruction clarity, and consistency**.
 ## Known Issues & Constraints
 - **Dataset limitations**: Limited diversity and scale compared to large foundation models
-- **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
 - **No RLHF applied**
 - **English-centric data distribution**
@@ -115,17 +124,12 @@ These factors directly influenced training duration and final model behavior.
 ## Intended Use
-This model is released **strictly for research and experimental purposes**.
 - Studying **small-scale MoE architectures**
 - Exploring **VGQA-style attention mechanisms**
 - Evaluating **NoPE / RoPE behavior in MoE models**
 - Educational and exploratory research
-**Not intended for production use.**
 ## Acknowledgements
@@ -136,10 +140,11 @@ We would like to thank the dataset providers and the open-source community whose
 - **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
 - **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
 - **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
 We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
 ## Contact
 Please use the Hugging Face **Discussions** tab to connect.

 - Text-Generation
 - Instruction Following
 - VGQA
+- Research
+- SLM
 datasets:
 - HuggingFaceFW/fineweb-edu
 - HuggingFaceH4/ultrachat_200k
 This work explores the following research question:
+> **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?**
 SlimMoE-250M was designed to study:
 - **Split**: `sample-10BT`
 - **Tokens Used**: **5.2B**
 - **Duration**: **7 days 16 hours**
+- **GPU**: **48GB NVIDIA A100**
+- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf
+### Fine-Tuning Phase-1 (SFT – Instruction Tuning)
+This stage introduces **instruction supervision** and conversational alignment.
 - **Dataset**: HuggingFaceH4/ultrachat_200k
 - **Split**: `train_sft`
 - **Duration**: **8 days 8 hours**
+- **GPU**: **80GB NVIDIA A100**
+- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf
 ### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
 - **Dataset**: cais/mmlu
 - **Split**: `auxiliary_train`
 - **Duration**: **8 days 11 hours**
+- **GPU**: **48GB NVIDIA A100**
+- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf
 ### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
 - **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
 - **Duration**: **5 days 1 hour**
+- **GPU**: **48GB NVIDIA A100**
+- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf
 ## VGQA & Positional Encoding Experiments
 ## Known Issues & Constraints
 - **Dataset limitations**: Limited diversity and scale compared to large foundation models
+- **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
+- **Loss fluctuations**
 - **No RLHF applied**
 - **English-centric data distribution**
 ## Intended Use
 - Studying **small-scale MoE architectures**
 - Exploring **VGQA-style attention mechanisms**
 - Evaluating **NoPE / RoPE behavior in MoE models**
 - Educational and exploratory research
 ## Acknowledgements
 - **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
 - **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
 - **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
+- **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress.
 We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
 ## Contact
 Please use the Hugging Face **Discussions** tab to connect.