Aispace2001's picture
Update README.md
a876969 verified
---
language:
- en
tags:
- MoE
- Text-Generation
- Instruction Following
- VGQA
- Research
- SLM
datasets:
- HuggingFaceFW/fineweb-edu
- HuggingFaceH4/ultrachat_200k
- cais/mmlu
- HuggingFaceTB/OpenHermes-2.5-H4
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
base_model:
- SlimFactoryHub/SlimMoE-250M-SFT-v2
---
# SlimMoE-250M-SFT-instruct
**SlimMoE-250M-instruct** is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases.
The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.
## Motivation
This work explores the following research question:
> **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?**
SlimMoE-250M was designed to study:
- MoE routing behavior at small scales
- VGQA-style attention mechanisms
- NoPE / RoPE compatibility in MoE architectures
- Quality vs. efficiency trade-offs under limited data and GPU availability
## Model Summary
| Property | Value |
|--------|------|
| Parameters | **250M** |
| Architecture | **SlimMoEForCausalLM** |
| Experts | **4** |
| Layers | **16** |
| Hidden Size | **768** |
| FFN Size | **1536** |
| Attention Heads | **12** |
| Max Context Length | **2048** |
| Routing | **Adaptive MoE Routing** |
| Dropout | **0.1** |
| Precision | **float32** |
| Vocabulary Size | **50,257** |
## Training Details
### Pretraining
This phase focused on **general language modeling** using high-quality educational data.
- **Dataset**: HuggingFaceFW/fineweb-edu
- **Split**: `sample-10BT`
- **Tokens Used**: **5.2B**
- **Duration**: **7 days 16 hours**
- **GPU**: **48GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf
### Fine-Tuning Phase-1 (SFT – Instruction Tuning)
This stage introduces **instruction supervision** and conversational alignment.
- **Dataset**: HuggingFaceH4/ultrachat_200k
- **Split**: `train_sft`
- **Duration**: **8 days 8 hours**
- **GPU**: **80GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf
### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
Used to improve **domain knowledge and reasoning performance**.
- **Dataset**: cais/mmlu
- **Split**: `auxiliary_train`
- **Duration**: **8 days 11 hours**
- **GPU**: **48GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf
### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
Focused on **response quality, instruction clarity, and consistency**.
- **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
- **Duration**: **5 days 1 hour**
- **GPU**: **48GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf
## VGQA & Positional Encoding Experiments
- The model was trained using a **VGQA-style attention mechanism**.
- Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**.
- The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance.
**Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.**
## Known Issues & Constraints
- **Dataset limitations**: Limited diversity and scale compared to large foundation models
- **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
- **Loss fluctuations**
- **No RLHF applied**
- **English-centric data distribution**
These factors directly influenced training duration and final model behavior.
## Intended Use
- Studying **small-scale MoE architectures**
- Exploring **VGQA-style attention mechanisms**
- Evaluating **NoPE / RoPE behavior in MoE models**
- Educational and exploratory research
## Acknowledgements
We would like to thank the dataset providers and the open-source community whose contributions made this work possible.
- **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
- **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining.
- **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
- **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
- **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
- **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress.
- Additionally, we drew valuable insights from **The Smol Training Playbook: The Secrets to Building World-Class LLMs**, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow.
Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf
We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
## Contact
Please use the Hugging Face **Discussions** tab to connect.