Aispace2001

Update README.md

a876969 verified about 1 month ago

5.66 kB

	---
	language:
	- en
	tags:
	- MoE
	- Text-Generation
	- Instruction Following
	- VGQA
	- Research
	- SLM
	datasets:
	- HuggingFaceFW/fineweb-edu
	- HuggingFaceH4/ultrachat_200k
	- cais/mmlu
	- HuggingFaceTB/OpenHermes-2.5-H4
	license: apache-2.0
	pipeline_tag: text-generation
	library_name: transformers
	base_model:
	- SlimFactoryHub/SlimMoE-250M-SFT-v2
	---

	# SlimMoE-250M-SFT-instruct

	SlimMoE-250M-instruct is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases.
	The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.


	## Motivation

	This work explores the following research question:

	> Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?

	SlimMoE-250M was designed to study:

	- MoE routing behavior at small scales
	- VGQA-style attention mechanisms
	- NoPE / RoPE compatibility in MoE architectures
	- Quality vs. efficiency trade-offs under limited data and GPU availability


	## Model Summary

	\| Property \| Value \|
	\|--------\|------\|
	\| Parameters \| 250M \|
	\| Architecture \| SlimMoEForCausalLM \|
	\| Experts \| 4 \|
	\| Layers \| 16 \|
	\| Hidden Size \| 768 \|
	\| FFN Size \| 1536 \|
	\| Attention Heads \| 12 \|
	\| Max Context Length \| 2048 \|
	\| Routing \| Adaptive MoE Routing \|
	\| Dropout \| 0.1 \|
	\| Precision \| float32 \|
	\| Vocabulary Size \| 50,257 \|


	## Training Details

	### Pretraining

	This phase focused on general language modeling using high-quality educational data.

	- Dataset: HuggingFaceFW/fineweb-edu
	- Split: `sample-10BT`
	- Tokens Used: 5.2B
	- Duration: 7 days 16 hours
	- GPU: 48GB NVIDIA A100
	- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf


	### Fine-Tuning Phase-1 (SFT – Instruction Tuning)

	This stage introduces instruction supervision and conversational alignment.

	- Dataset: HuggingFaceH4/ultrachat_200k
	- Split: `train_sft`
	- Duration: 8 days 8 hours
	- GPU: 80GB NVIDIA A100
	- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf


	### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)

	Used to improve domain knowledge and reasoning performance.

	- Dataset: cais/mmlu
	- Split: `auxiliary_train`
	- Duration: 8 days 11 hours
	- GPU: 48GB NVIDIA A100
	- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf


	### Fine-Tuning Phase-3 (SFT – Instruction Refinement)

	Focused on response quality, instruction clarity, and consistency.

	- Dataset: HuggingFaceTB/OpenHermes-2.5-H4
	- Duration: 5 days 1 hour
	- GPU: 48GB NVIDIA A100
	- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf


	## VGQA & Positional Encoding Experiments

	- The model was trained using a VGQA-style attention mechanism.
	- Experiments were conducted with NoPE / RoPE positional strategies within a small MoE architecture.
	- The objective was to evaluate training stability and output quality, not to optimize benchmark performance.

	Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.

	## Known Issues & Constraints

	- Dataset limitations: Limited diversity and scale compared to large foundation models
	- GPU constraints: Training conducted under restricted GPU availability and memory budgets
	- Loss fluctuations
	- No RLHF applied
	- English-centric data distribution

	These factors directly influenced training duration and final model behavior.


	## Intended Use


	- Studying small-scale MoE architectures
	- Exploring VGQA-style attention mechanisms
	- Evaluating NoPE / RoPE behavior in MoE models
	- Educational and exploratory research


	## Acknowledgements

	We would like to thank the dataset providers and the open-source community whose contributions made this work possible.

	- Hugging Face for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
	- HuggingFaceFW for the FineWeb-Edu dataset used during pretraining.
	- HuggingFaceH4 for the UltraChat 200K dataset used in supervised fine-tuning.
	- CAIS for the MMLU dataset used for auxiliary knowledge and reasoning supervision.
	- HuggingFaceTB for the OpenHermes-2.5-H4 dataset used in the final instruction refinement phase.
	- Weights & Biases (W&B) for logging and visualization tools used to monitor training progress.
	- Additionally, we drew valuable insights from The Smol Training Playbook: The Secrets to Building World-Class LLMs, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow.
	Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf

	We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.


	## Contact
	Please use the Hugging Face Discussions tab to connect.

	---
	language:
	- en
	tags:
	- MoE
	- Text-Generation
	- Instruction Following
	- VGQA
	- Research
	- SLM
	datasets:
	- HuggingFaceFW/fineweb-edu
	- HuggingFaceH4/ultrachat_200k
	- cais/mmlu
	- HuggingFaceTB/OpenHermes-2.5-H4
	license: apache-2.0
	pipeline_tag: text-generation
	library_name: transformers
	base_model:
	- SlimFactoryHub/SlimMoE-250M-SFT-v2
	---

	# SlimMoE-250M-SFT-instruct

	SlimMoE-250M-instruct is the final refined instruction-tuned version of the model.This stage emphasizes response quality, instruction clarity, consistency, and conversational coherence, building on the instruction-following and reasoning capabilities developed in earlier phases.
	The objective of this phase is to produce a stable and well-aligned small MoE instruction model, suitable for research and experimental evaluation under limited data and compute constraints.


	## Motivation

	This work explores the following research question:

	> Can a small (<500M) MoE model effectively support different attention mechanisms and alternative positional encodings under constrained compute?

	SlimMoE-250M was designed to study:

	- MoE routing behavior at small scales
	- VGQA-style attention mechanisms
	- NoPE / RoPE compatibility in MoE architectures
	- Quality vs. efficiency trade-offs under limited data and GPU availability


	## Model Summary

	\| Property \| Value \|
	\|--------\|------\|
	\| Parameters \| 250M \|
	\| Architecture \| SlimMoEForCausalLM \|
	\| Experts \| 4 \|
	\| Layers \| 16 \|
	\| Hidden Size \| 768 \|
	\| FFN Size \| 1536 \|
	\| Attention Heads \| 12 \|
	\| Max Context Length \| 2048 \|
	\| Routing \| Adaptive MoE Routing \|
	\| Dropout \| 0.1 \|
	\| Precision \| float32 \|
	\| Vocabulary Size \| 50,257 \|


	## Training Details

	### Pretraining

	This phase focused on general language modeling using high-quality educational data.

	- Dataset: HuggingFaceFW/fineweb-edu
	- Split: `sample-10BT`
	- Tokens Used: 5.2B
	- Duration: 7 days 16 hours
	- GPU: 48GB NVIDIA A100
	- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-base/blob/main/PreTraining.pdf


	### Fine-Tuning Phase-1 (SFT – Instruction Tuning)

	This stage introduces instruction supervision and conversational alignment.

	- Dataset: HuggingFaceH4/ultrachat_200k
	- Split: `train_sft`
	- Duration: 8 days 8 hours
	- GPU: 80GB NVIDIA A100
	- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v1/blob/main/SFT_v1.pdf


	### Fine-Tuning Phase-2 (SFT – Knowledge & Reasoning)

	Used to improve domain knowledge and reasoning performance.

	- Dataset: cais/mmlu
	- Split: `auxiliary_train`
	- Duration: 8 days 11 hours
	- GPU: 48GB NVIDIA A100
	- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-SFT-v2/blob/main/SFT_v2.pdf


	### Fine-Tuning Phase-3 (SFT – Instruction Refinement)

	Focused on response quality, instruction clarity, and consistency.

	- Dataset: HuggingFaceTB/OpenHermes-2.5-H4
	- Duration: 5 days 1 hour
	- GPU: 48GB NVIDIA A100
	- Training Logs: https://huggingface.co/SlimFactoryHub/SlimMoE-250M-instruct/blob/main/SFT_v3.pdf


	## VGQA & Positional Encoding Experiments

	- The model was trained using a VGQA-style attention mechanism.
	- Experiments were conducted with NoPE / RoPE positional strategies within a small MoE architecture.
	- The objective was to evaluate training stability and output quality, not to optimize benchmark performance.

	Given the dataset scale, GPU availability, and training time, the observed performance is reasonable and stable for this model size.

	## Known Issues & Constraints

	- Dataset limitations: Limited diversity and scale compared to large foundation models
	- GPU constraints: Training conducted under restricted GPU availability and memory budgets
	- Loss fluctuations
	- No RLHF applied
	- English-centric data distribution

	These factors directly influenced training duration and final model behavior.


	## Intended Use


	- Studying small-scale MoE architectures
	- Exploring VGQA-style attention mechanisms
	- Evaluating NoPE / RoPE behavior in MoE models
	- Educational and exploratory research


	## Acknowledgements

	We would like to thank the dataset providers and the open-source community whose contributions made this work possible.

	- Hugging Face for providing the hosting infrastructure, model hub, datasets library, and tools that enabled training, evaluation, and open sharing of this model.
	- HuggingFaceFW for the FineWeb-Edu dataset used during pretraining.
	- HuggingFaceH4 for the UltraChat 200K dataset used in supervised fine-tuning.
	- CAIS for the MMLU dataset used for auxiliary knowledge and reasoning supervision.
	- HuggingFaceTB for the OpenHermes-2.5-H4 dataset used in the final instruction refinement phase.
	- Weights & Biases (W&B) for logging and visualization tools used to monitor training progress.
	- Additionally, we drew valuable insights from The Smol Training Playbook: The Secrets to Building World-Class LLMs, published by Hugging Face, which informed several practical decisions in our training and experimentation workflow.
	Playbook link: https://huggingfacetb-smol-training-playbook.hf.space/the-smol-training-playbook-the-secrets-to-building-world-class-llms.pdf

	We also acknowledge the broader open-source research community for their continuous efforts in advancing efficient model architectures and training methodologies.


	## Contact
	Please use the Hugging Face Discussions tab to connect.