|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- MoE |
|
|
- Text-Generation |
|
|
- Instruction Following |
|
|
- VGQA |
|
|
- Research |
|
|
- SLM |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-edu |
|
|
- HuggingFaceH4/ultrachat_200k |
|
|
- cais/mmlu |
|
|
- HuggingFaceTB/OpenHermes-2.5-H4 |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
base_model: |
|
|
- SlimFactoryHub/SlimMoE-250M-SFT-v2 |
|
|
--- |
|
|
|
|
|
# SlimMoE-250M-SFT-instruct |
|
|
|
|
|
**SlimMoE-250M-instruct** is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases. |
|
|
The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints. |
|
|
|
|
|
|
|
|
## Motivation |
|
|
|
|
|
This work explores the following research question: |
|
|
|
|
|
> **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?** |
|
|
|
|
|
SlimMoE-250M was designed to study: |
|
|
|
|
|
- MoE routing behavior at small scales |
|
|
- VGQA-style attention mechanisms |
|
|
- NoPE / RoPE compatibility in MoE architectures |
|
|
- Quality vs. efficiency trade-offs under limited data and GPU availability |
|
|
|
|
|
|
|
|
## Model Summary |
|
|
|
|
|
| Property | Value | |
|
|
|--------|------| |
|
|
| Parameters | **250M** | |
|
|
| Architecture | **SlimMoEForCausalLM** | |
|
|
| Experts | **4** | |
|
|
| Layers | **16** | |
|
|
| Hidden Size | **768** | |
|
|
| FFN Size | **1536** | |
|
|
| Attention Heads | **12** | |
|
|
| Max Context Length | **2048** | |
|
|
| Routing | **Adaptive MoE Routing** | |
|
|
| Dropout | **0.1** | |
|
|
| Precision | **float32** | |
|
|
| Vocabulary Size | **50,257** | |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Pretraining |
|
|
|
|
|
This phase focused on **general language modeling** using high-quality educational data. |
|
|
|
|
|
- **Dataset**: HuggingFaceFW/fineweb-edu |
|
|
- **Split**: `sample-10BT` |
|
|
- **Tokens Used**: **5.2B** |
|
|
- **Duration**: **7 days 16 hours** |
|
|
- **GPU**: **48GB NVIDIA A100** |
|
|
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf |
|
|
|
|
|
|
|
|
### Fine-Tuning Phase-1 (SFT – Instruction Tuning) |
|
|
|
|
|
This stage introduces **instruction supervision** and conversational alignment. |
|
|
|
|
|
- **Dataset**: HuggingFaceH4/ultrachat_200k |
|
|
- **Split**: `train_sft` |
|
|
- **Duration**: **8 days 8 hours** |
|
|
- **GPU**: **80GB NVIDIA A100** |
|
|
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf |
|
|
|
|
|
|
|
|
### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning) |
|
|
|
|
|
Used to improve **domain knowledge and reasoning performance**. |
|
|
|
|
|
- **Dataset**: cais/mmlu |
|
|
- **Split**: `auxiliary_train` |
|
|
- **Duration**: **8 days 11 hours** |
|
|
- **GPU**: **48GB NVIDIA A100** |
|
|
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf |
|
|
|
|
|
|
|
|
### Fine-Tuning Phase-3 (SFT – Instruction Refinement) |
|
|
|
|
|
Focused on **response quality, instruction clarity, and consistency**. |
|
|
|
|
|
- **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4 |
|
|
- **Duration**: **5 days 1 hour** |
|
|
- **GPU**: **48GB NVIDIA A100** |
|
|
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf |
|
|
|
|
|
|
|
|
## VGQA & Positional Encoding Experiments |
|
|
|
|
|
- The model was trained using a **VGQA-style attention mechanism**. |
|
|
- Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**. |
|
|
- The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance. |
|
|
|
|
|
**Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.** |
|
|
|
|
|
## Known Issues & Constraints |
|
|
|
|
|
- **Dataset limitations**: Limited diversity and scale compared to large foundation models |
|
|
- **GPU constraints**: Training conducted under restricted GPU availability and memory budgets |
|
|
- **Loss fluctuations** |
|
|
- **No RLHF applied** |
|
|
- **English-centric data distribution** |
|
|
|
|
|
These factors directly influenced training duration and final model behavior. |
|
|
|
|
|
|
|
|
## Intended Use |
|
|
|
|
|
|
|
|
- Studying **small-scale MoE architectures** |
|
|
- Exploring **VGQA-style attention mechanisms** |
|
|
- Evaluating **NoPE / RoPE behavior in MoE models** |
|
|
- Educational and exploratory research |
|
|
|
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
We would like to thank the dataset providers and the open-source community whose contributions made this work possible. |
|
|
|
|
|
- **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model. |
|
|
- **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining. |
|
|
- **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning. |
|
|
- **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision. |
|
|
- **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase. |
|
|
- **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress. |
|
|
- Additionally, we drew valuable insights from **The Smol Training Playbook: The Secrets to Building World-Class LLMs**, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow. |
|
|
Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf |
|
|
|
|
|
We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies. |
|
|
|
|
|
|
|
|
## Contact |
|
|
Please use the Hugging Face **Discussions** tab to connect. |