Aispace2001 commited on
Commit
cc8b7b2
·
verified ·
1 Parent(s): 94c2ea1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -3
README.md CHANGED
@@ -1,3 +1,151 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - MoE
6
+ - Text-Generation
7
+ - Instruction Following
8
+ - VGQA
9
+ - Research
10
+ - SLM
11
+ datasets:
12
+ - HuggingFaceFW/fineweb-edu
13
+ - HuggingFaceH4/ultrachat_200k
14
+ - cais/mmlu
15
+ - HuggingFaceTB/OpenHermes-2.5-H4
16
+ license: apache-2.0
17
+ pipeline_tag: text-generation
18
+ library_name: transformers
19
+ base_model:
20
+ - SlimFactoryHub/SlimMoE-250M-SFT-v2
21
+ ---
22
+
23
+ # SlimMoE-250M-SFT-instruct
24
+
25
+ **SlimMoE-250M-instruct** is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases.
26
+ The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.
27
+
28
+
29
+ ## Motivation
30
+
31
+ This work explores the following research question:
32
+
33
+ > **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?**
34
+
35
+ SlimMoE-250M was designed to study:
36
+
37
+ - MoE routing behavior at small scales
38
+ - VGQA-style attention mechanisms
39
+ - NoPE / RoPE compatibility in MoE architectures
40
+ - Quality vs. efficiency trade-offs under limited data and GPU availability
41
+
42
+
43
+ ## Model Summary
44
+
45
+ | Property | Value |
46
+ |--------|------|
47
+ | Parameters | **250M** |
48
+ | Architecture | **SlimMoEForCausalLM** |
49
+ | Experts | **4** |
50
+ | Layers | **16** |
51
+ | Hidden Size | **768** |
52
+ | FFN Size | **1536** |
53
+ | Attention Heads | **12** |
54
+ | Max Context Length | **2048** |
55
+ | Routing | **Adaptive MoE Routing** |
56
+ | Dropout | **0.1** |
57
+ | Precision | **float32** |
58
+ | Vocabulary Size | **50,257** |
59
+
60
+
61
+ ## Training Details
62
+
63
+ ### Pretraining
64
+
65
+ This phase focused on **general language modeling** using high-quality educational data.
66
+
67
+ - **Dataset**: HuggingFaceFW/fineweb-edu
68
+ - **Split**: `sample-10BT`
69
+ - **Tokens Used**: **5.2B**
70
+ - **Duration**: **7 days 16 hours**
71
+ - **GPU**: **48GB NVIDIA A100**
72
+ - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf
73
+
74
+
75
+ ### Fine-Tuning Phase-1 (SFT – Instruction Tuning)
76
+
77
+ This stage introduces **instruction supervision** and conversational alignment.
78
+
79
+ - **Dataset**: HuggingFaceH4/ultrachat_200k
80
+ - **Split**: `train_sft`
81
+ - **Duration**: **8 days 8 hours**
82
+ - **GPU**: **80GB NVIDIA A100**
83
+ - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf
84
+
85
+
86
+ ### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
87
+
88
+ Used to improve **domain knowledge and reasoning performance**.
89
+
90
+ - **Dataset**: cais/mmlu
91
+ - **Split**: `auxiliary_train`
92
+ - **Duration**: **8 days 11 hours**
93
+ - **GPU**: **48GB NVIDIA A100**
94
+ - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf
95
+
96
+
97
+ ### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
98
+
99
+ Focused on **response quality, instruction clarity, and consistency**.
100
+
101
+ - **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
102
+ - **Duration**: **5 days 1 hour**
103
+ - **GPU**: **48GB NVIDIA A100**
104
+ - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf
105
+
106
+
107
+ ## VGQA & Positional Encoding Experiments
108
+
109
+ - The model was trained using a **VGQA-style attention mechanism**.
110
+ - Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**.
111
+ - The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance.
112
+
113
+ **Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.**
114
+
115
+ ## Known Issues & Constraints
116
+
117
+ - **Dataset limitations**: Limited diversity and scale compared to large foundation models
118
+ - **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
119
+ - **Loss fluctuations**
120
+ - **No RLHF applied**
121
+ - **English-centric data distribution**
122
+
123
+ These factors directly influenced training duration and final model behavior.
124
+
125
+
126
+ ## Intended Use
127
+
128
+
129
+ - Studying **small-scale MoE architectures**
130
+ - Exploring **VGQA-style attention mechanisms**
131
+ - Evaluating **NoPE / RoPE behavior in MoE models**
132
+ - Educational and exploratory research
133
+
134
+
135
+ ## Acknowledgements
136
+
137
+ We would like to thank the dataset providers and the open-source community whose contributions made this work possible.
138
+
139
+ - **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
140
+ - **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining.
141
+ - **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
142
+ - **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
143
+ - **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
144
+ - **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress.
145
+
146
+
147
+ We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
148
+
149
+
150
+ ## Contact
151
+ Please use the Hugging Face **Discussions** tab to connect.