Aispace2001 commited on
Commit
4d8ceb8
·
verified ·
1 Parent(s): 6e20f97

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -4
README.md CHANGED
@@ -1,8 +1,145 @@
1
  ---
2
- license: apache-2.0
3
- datasets:
4
- - HuggingFaceFW/finewiki
5
  language:
6
  - en
 
 
 
 
 
 
 
 
 
 
 
7
  pipeline_tag: text-generation
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  language:
3
  - en
4
+ tags:
5
+ - MoE
6
+ - Text-Generation
7
+ - Instruction Following
8
+ - VGQA
9
+ datasets:
10
+ - HuggingFaceFW/fineweb-edu
11
+ - HuggingFaceH4/ultrachat_200k
12
+ - cais/mmlu
13
+ - HuggingFaceTB/OpenHermes-2.5-H4
14
+ license: apache-2.0
15
  pipeline_tag: text-generation
16
+ library_name: transformers
17
+ ---
18
+
19
+ # SlimMoE-250M
20
+
21
+ **SlimMoE-250M** is a 250M parameter Mixture-of-Experts (MoE) language model developed by the **SlimFactory team**.This model was trained to **experiment with VGQA-style attention mechanisms and NoPE/RoPE positional strategies in a small-parameter MoE setting**, focusing on architectural feasibility and training stability rather than scale or benchmark maximization.
22
+
23
+
24
+ ## Motivation
25
+
26
+ This work explores the following research question:
27
+
28
+ > **Can a small (<500M) MoE model effectively support VGQA-style attention mechanisms and alternative positional encodings under constrained compute?**
29
+
30
+ SlimMoE-250M was designed to study:
31
+
32
+ - MoE routing behavior at small scales
33
+ - VGQA-style attention mechanisms
34
+ - NoPE / RoPE compatibility in MoE architectures
35
+ - Quality vs. efficiency trade-offs under limited data and GPU availability
36
+
37
+
38
+
39
+
40
+ ## Model Summary
41
+
42
+ | Property | Value |
43
+ |--------|------|
44
+ | Parameters | **250M** |
45
+ | Architecture | **SlimMoEForCausalLM** |
46
+ | Experts | **4** |
47
+ | Layers | **16** |
48
+ | Hidden Size | **768** |
49
+ | FFN Size | **1536** |
50
+ | Attention Heads | **12** |
51
+ | Max Context Length | **2048** |
52
+ | Routing | **Adaptive MoE Routing** |
53
+ | Dropout | **0.1** |
54
+ | Precision | **float32** |
55
+ | Vocabulary Size | **50,257** |
56
+
57
+
58
+ ## Training Details
59
+
60
+ ### Pretraining
61
+
62
+ This phase focused on **general language modeling** using high-quality educational data.
63
+
64
+ - **Dataset**: HuggingFaceFW/fineweb-edu
65
+ - **Split**: `sample-10BT`
66
+ - **Tokens Used**: **5.2B**
67
+ - **Duration**: **7 days 16 hours**
68
+ - **GPU**: **48GB NVIDIA A100**
69
+
70
+
71
+ ### Fine-Tuning Phase-1 (SFT – VGQA / Instruction)
72
+
73
+ This stage introduces **VGQA-style instruction supervision** and conversational alignment.
74
+
75
+ - **Dataset**: HuggingFaceH4/ultrachat_200k
76
+ - **Split**: `train_sft`
77
+ - **Duration**: **8 days 8 hours**
78
+ - **GPU**: **80GB NVIDIA A100**
79
+
80
+ ### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
81
+
82
+ Used to improve **domain knowledge and reasoning performance**.
83
+
84
+ - **Dataset**: cais/mmlu
85
+ - **Split**: `auxiliary_train`
86
+ - **Duration**: **8 days 11 hours**
87
+ - **GPU**: **48GB NVIDIA A100**
88
+
89
+ ### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
90
+
91
+ Focused on **response quality, instruction clarity, and consistency**.
92
+
93
+ - **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
94
+ - **Duration**: **5 days 1 hour**
95
+ - **GPU**: **48GB NVIDIA A100**
96
+
97
+
98
+ ## VGQA & Positional Encoding Experiments
99
+
100
+ - The model was trained using a **VGQA-style attention mechanism**.
101
+ - Experiments were conducted with **NoPE / RoPE positional strategies** within a **small MoE architecture**.
102
+ - The objective was to evaluate **training stability and output quality**, not to optimize benchmark performance.
103
+
104
+ **Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.**
105
+
106
+ ## Known Issues & Constraints
107
+
108
+ - **Dataset limitations**: Limited diversity and scale compared to large foundation models
109
+ - **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
110
+ - **No RLHF applied**
111
+ - **English-centric data distribution**
112
+
113
+ These factors directly influenced training duration and final model behavior.
114
+
115
+
116
+ ## Intended Use
117
+
118
+ This model is released **strictly for research and experimental purposes**.
119
+
120
+ - Studying **small-scale MoE architectures**
121
+ - Exploring **VGQA-style attention mechanisms**
122
+ - Evaluating **NoPE / RoPE behavior in MoE models**
123
+ - Educational and exploratory research
124
+
125
+ **Not intended for production use.**
126
+
127
+
128
+
129
+
130
+ ## Acknowledgements
131
+
132
+ We would like to thank the dataset providers and the open-source community whose contributions made this work possible.
133
+
134
+ - **Hugging Face** for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
135
+ - **HuggingFaceFW** for the **FineWeb-Edu** dataset used during pretraining.
136
+ - **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
137
+ - **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
138
+ - **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
139
+
140
+ We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
141
+
142
+
143
+
144
+ ## Contact
145
+ Please use the Hugging Face **Discussions** tab to connect.