File size: 5,659 Bytes
cc8b7b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a876969
 
cc8b7b2
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
language:
- en
tags:
- MoE
- Text-Generation
- Instruction Following
- VGQA
- Research
- SLM
datasets:
- HuggingFaceFW/fineweb-edu
- HuggingFaceH4/ultrachat_200k
- cais/mmlu
- HuggingFaceTB/OpenHermes-2.5-H4
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
base_model:
- SlimFactoryHub/SlimMoE-250M-SFT-v2
---

# SlimMoE-250M-SFT-instruct

**SlimMoE-250M-instruct** is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases.
The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.


## Motivation

This work explores the following research question:

> **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?**

SlimMoE-250M was designed to study:

- MoE routing behavior at small scales  
- VGQA-style attention mechanisms  
- NoPE / RoPE compatibility in MoE architectures  
- Quality vs. efficiency trade-offs under limited data and GPU availability


## Model Summary

| Property | Value |
|--------|------|
| Parameters | **250M** |
| Architecture | **SlimMoEForCausalLM** |
| Experts | **4** |
| Layers | **16** |
| Hidden Size | **768** |
| FFN Size | **1536** |
| Attention Heads | **12** |
| Max Context Length | **2048** |
| Routing | **Adaptive MoE Routing** |
| Dropout | **0.1** |
| Precision | **float32** |
| Vocabulary Size | **50,257** |


## Training Details

### Pretraining

This phase focused on **general language modeling** using high-quality educational data.

- **Dataset**: HuggingFaceFW/fineweb-edu  
- **Split**: `sample-10BT`  
- **Tokens Used**: **5.2B**  
- **Duration**: **7 days 16 hours**  
- **GPU**: **48GB NVIDIA A100** 
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf


### Fine-Tuning Phase-1 (SFT – Instruction Tuning)

This stage introduces **instruction supervision** and conversational alignment.

- **Dataset**: HuggingFaceH4/ultrachat_200k  
- **Split**: `train_sft`  
- **Duration**: **8 days 8 hours**  
- **GPU**: **80GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf


### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)

Used to improve **domain knowledge and reasoning performance**.

- **Dataset**: cais/mmlu  
- **Split**: `auxiliary_train`  
- **Duration**: **8 days 11 hours**  
- **GPU**: **48GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf


### Fine-Tuning Phase-3 (SFT – Instruction Refinement)

Focused on **response quality, instruction clarity, and consistency**.

- **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4  
- **Duration**: **5 days 1 hour**  
- **GPU**: **48GB NVIDIA A100**
- **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf  


## VGQA & Positional Encoding Experiments

- The model was trained using a **VGQA-style attention mechanism**.
- Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**.
- The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance.

**Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.**

## Known Issues & Constraints

- **Dataset limitations**: Limited diversity and scale compared to large foundation models  
- **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
- **Loss fluctuations**
- **No RLHF applied**  
- **English-centric data distribution**

These factors directly influenced training duration and final model behavior.


## Intended Use


- Studying **small-scale MoE architectures**
- Exploring **VGQA-style attention mechanisms**
- Evaluating **NoPE / RoPE behavior in MoE models**
- Educational and exploratory research


## Acknowledgements

We would like to thank the dataset providers and the open-source community whose contributions made this work possible.

- **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
- **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining.
- **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
- **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
- **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
- **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress.
- Additionally, we drew valuable insights from **The Smol Training Playbook: The Secrets to Building World-Class LLMs**, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow.  
Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf 

We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.


## Contact
Please use the Hugging Face **Discussions** tab to connect.