---
language:
- en
tags:
- MoE
- Text-Generation
- Instruction Following
- VGQA
- Research
- SLM
datasets:
- HuggingFaceFW/fineweb-edu
- HuggingFaceH4/ultrachat_200k
- cais/mmlu
- HuggingFaceTB/OpenHermes-2.5-H4
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
base_model:
- SlimFactoryHub/SlimMoE-250M-SFT-v2
---

# SlimMoE-250M-SFT-instruct

**SlimMoE-250M-instruct** is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases.
The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.


## Motivation

This work explores the following research question:

> **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?**

SlimMoE-250M was designed to study:

- MoE routing behavior at small scales  
- VGQA-style attention mechanisms  
- NoPE / RoPE compatibility in MoE architectures  
- Quality vs. efficiency trade-offs under limited data and GPU availability


## Model Summary

| Property | Value |
|--------|------|
| Parameters | **250M** |
| Architecture | **SlimMoEForCausalLM** |
| Experts | **4** |
| Layers | **16** |
| Hidden Size | **768** |
| FFN Size | **1536** |
| Attention Heads | **12** |
| Max Context Length | **2048** |
| Routing | **Adaptive MoE Routing** |
| Dropout | **0.1** |
| Precision | **float32** |
| Vocabulary Size | **50,257** |


## Training Details

### Pretraining

This phase focused on **general language modeling** using high-quality educational data.

- **Dataset**: HuggingFaceFW/fineweb-edu  
- **Split**: `sample-10BT`  
- **Tokens Used**: **5.2B**  
- **Duration**: **7 days 16 hours**  
- **GPU**: **48GB NVIDIA A100** 
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf


### Fine-Tuning Phase-1 (SFT – Instruction Tuning)

This stage introduces **instruction supervision** and conversational alignment.

- **Dataset**: HuggingFaceH4/ultrachat_200k  
- **Split**: `train_sft`  
- **Duration**: **8 days 8 hours**  
- **GPU**: **80GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf


### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)

Used to improve **domain knowledge and reasoning performance**.

- **Dataset**: cais/mmlu  
- **Split**: `auxiliary_train`  
- **Duration**: **8 days 11 hours**  
- **GPU**: **48GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf


### Fine-Tuning Phase-3 (SFT – Instruction Refinement)

Focused on **response quality, instruction clarity, and consistency**.

- **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4  
- **Duration**: **5 days 1 hour**  
- **GPU**: **48GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf  


## VGQA & Positional Encoding Experiments

- The model was trained using a **VGQA-style attention mechanism**.
- Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**.
- The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance.

**Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.**

## Known Issues & Constraints

- **Dataset limitations**: Limited diversity and scale compared to large foundation models  
- **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
- **Loss fluctuations**
- **No RLHF applied**  
- **English-centric data distribution**

These factors directly influenced training duration and final model behavior.


## Intended Use


- Studying **small-scale MoE architectures**
- Exploring **VGQA-style attention mechanisms**
- Evaluating **NoPE / RoPE behavior in MoE models**
- Educational and exploratory research


## Acknowledgements

We would like to thank the dataset providers and the open-source community whose contributions made this work possible.

- **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
- **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining.
- **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
- **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
- **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
- **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress.
- Additionally, we drew valuable insights from **The Smol Training Playbook: The Secrets to Building World-Class LLMs**, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow.  
Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf 

We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.


## Contact
Please use the Hugging Face **Discussions** tab to connect.