Aispace2001 commited on
Commit
831ecdf
·
verified ·
1 Parent(s): a0e6dc6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -14
README.md CHANGED
@@ -6,6 +6,8 @@ tags:
6
  - Text-Generation
7
  - Instruction Following
8
  - VGQA
 
 
9
  datasets:
10
  - HuggingFaceFW/fineweb-edu
11
  - HuggingFaceH4/ultrachat_200k
@@ -25,7 +27,7 @@ library_name: transformers
25
 
26
  This work explores the following research question:
27
 
28
- > **Can a small (<500M) MoE model effectively support VGQA-style attention mechanisms and alternative positional encodings under constrained compute?**
29
 
30
  SlimMoE-250M was designed to study:
31
 
@@ -65,17 +67,20 @@ This phase focused on **general language modeling** using high-quality education
65
  - **Split**: `sample-10BT`
66
  - **Tokens Used**: **5.2B**
67
  - **Duration**: **7 days 16 hours**
68
- - **GPU**: **48GB NVIDIA A100**
 
69
 
70
 
71
- ### Fine-Tuning Phase-1 (SFT – VGQA / Instruction)
72
 
73
- This stage introduces **VGQA-style instruction supervision** and conversational alignment.
74
 
75
  - **Dataset**: HuggingFaceH4/ultrachat_200k
76
  - **Split**: `train_sft`
77
  - **Duration**: **8 days 8 hours**
78
- - **GPU**: **80GB NVIDIA A100**
 
 
79
 
80
  ### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
81
 
@@ -84,7 +89,9 @@ Used to improve **domain knowledge and reasoning performance**.
84
  - **Dataset**: cais/mmlu
85
  - **Split**: `auxiliary_train`
86
  - **Duration**: **8 days 11 hours**
87
- - **GPU**: **48GB NVIDIA A100**
 
 
88
 
89
  ### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
90
 
@@ -92,7 +99,8 @@ Focused on **response quality, instruction clarity, and consistency**.
92
 
93
  - **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
94
  - **Duration**: **5 days 1 hour**
95
- - **GPU**: **48GB NVIDIA A100**
 
96
 
97
 
98
  ## VGQA & Positional Encoding Experiments
@@ -106,7 +114,8 @@ Focused on **response quality, instruction clarity, and consistency**.
106
  ## Known Issues & Constraints
107
 
108
  - **Dataset limitations**: Limited diversity and scale compared to large foundation models
109
- - **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
 
110
  - **No RLHF applied**
111
  - **English-centric data distribution**
112
 
@@ -115,17 +124,12 @@ These factors directly influenced training duration and final model behavior.
115
 
116
  ## Intended Use
117
 
118
- This model is released **strictly for research and experimental purposes**.
119
 
120
  - Studying **small-scale MoE architectures**
121
  - Exploring **VGQA-style attention mechanisms**
122
  - Evaluating **NoPE / RoPE behavior in MoE models**
123
  - Educational and exploratory research
124
 
125
- **Not intended for production use.**
126
-
127
-
128
-
129
 
130
  ## Acknowledgements
131
 
@@ -136,10 +140,11 @@ We would like to thank the dataset providers and the open-source community whose
136
  - **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
137
  - **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
138
  - **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
 
 
139
 
140
  We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
141
 
142
 
143
-
144
  ## Contact
145
  Please use the Hugging Face **Discussions** tab to connect.
 
6
  - Text-Generation
7
  - Instruction Following
8
  - VGQA
9
+ - Research
10
+ - SLM
11
  datasets:
12
  - HuggingFaceFW/fineweb-edu
13
  - HuggingFaceH4/ultrachat_200k
 
27
 
28
  This work explores the following research question:
29
 
30
+ > **Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?**
31
 
32
  SlimMoE-250M was designed to study:
33
 
 
67
  - **Split**: `sample-10BT`
68
  - **Tokens Used**: **5.2B**
69
  - **Duration**: **7 days 16 hours**
70
+ - **GPU**: **48GB NVIDIA A100**
71
+ - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf
72
 
73
 
74
+ ### Fine-Tuning Phase-1 (SFT – Instruction Tuning)
75
 
76
+ This stage introduces **instruction supervision** and conversational alignment.
77
 
78
  - **Dataset**: HuggingFaceH4/ultrachat_200k
79
  - **Split**: `train_sft`
80
  - **Duration**: **8 days 8 hours**
81
+ - **GPU**: **80GB NVIDIA A100**
82
+ - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf
83
+
84
 
85
  ### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)
86
 
 
89
  - **Dataset**: cais/mmlu
90
  - **Split**: `auxiliary_train`
91
  - **Duration**: **8 days 11 hours**
92
+ - **GPU**: **48GB NVIDIA A100**
93
+ - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf
94
+
95
 
96
  ### Fine-Tuning Phase-3 (SFT – Instruction Refinement)
97
 
 
99
 
100
  - **Dataset**: HuggingFaceTB/OpenHermes-2.5-H4
101
  - **Duration**: **5 days 1 hour**
102
+ - **GPU**: **48GB NVIDIA A100**
103
+ - **Training Logs**: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf
104
 
105
 
106
  ## VGQA & Positional Encoding Experiments
 
114
  ## Known Issues & Constraints
115
 
116
  - **Dataset limitations**: Limited diversity and scale compared to large foundation models
117
+ - **GPU constraints**: Training conducted under restricted GPU availability and memory budgets
118
+ - **Loss fluctuations**
119
  - **No RLHF applied**
120
  - **English-centric data distribution**
121
 
 
124
 
125
  ## Intended Use
126
 
 
127
 
128
  - Studying **small-scale MoE architectures**
129
  - Exploring **VGQA-style attention mechanisms**
130
  - Evaluating **NoPE / RoPE behavior in MoE models**
131
  - Educational and exploratory research
132
 
 
 
 
 
133
 
134
  ## Acknowledgements
135
 
 
140
  - **HuggingFaceH4** for the **UltraChat 200K** dataset used in supervised fine-tuning.
141
  - **CAIS** for the **MMLU** dataset used for auxiliary knowledge and reasoning supervision.
142
  - **HuggingFaceTB** for the **OpenHermes-2.5-H4** dataset used in the final instruction refinement phase.
143
+ - **Weights & Biases (W&B)** for logging and visualization tools used to monitor training progress.
144
+
145
 
146
  We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.
147
 
148
 
 
149
  ## Contact
150
  Please use the Hugging Face **Discussions** tab to connect.