QuarkML
/

QMoE-400

+---
+license: apache-2.0
+language:
+- en
+tags:
+- moe
+- sparse-mixture-of-experts
+- jax
+- flax
+- pytorch
+- text-generation
+- openwebtext
+---
+# Q-MoE-400
+**Q-MoE-400** is a 400 million parameter Sparse Mixture of Experts (MoE) model trained on the OpenWebText dataset using JAX/Flax on 8 TPU v3 chips.
+This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs.
+## 🎯 Project Goal
+The primary goal of the Q-MoE project is to investigate:
+1.  **Compute Efficiency:** Analyzing how sparse MoE models scale compared to dense counterparts with similar active parameter counts.
+2.  **Routing Dynamics:** Studying load balancing and expert specialization during pre-training.
+3.  **Interoperability:** Providing a bridge between research frameworks (JAX/Flax) and accessible inference (PyTorch).
+## 📊 Training Metrics
+The model was evaluated at step **79,100**. The final validation metrics indicate stable routing and convergence on the OpenWebText validation split.
+| Metric | Value | Description |
+| :--- | :--- | :--- |
+| **Step** | 79,100 | Total training steps |
+| **Train Loss** | 3.2190 | Total training loss (CE + Aux) |
+| **Train CE** | 3.0987 | Cross-Entropy loss on training data |
+| **Val Loss** | 3.2028 | Total validation loss |
+| **Val CE** | 3.0825 | Cross-Entropy loss on validation data |
+| **Router Loss** | 0.1202 | Auxiliary load-balancing loss |
+| **Dropped Tokens** | 0.0 | No tokens dropped (perfect capacity utilization) |
+### Training Progress
+*(Upload your validation loss chart here to visualize the learning curve)*
+![image](https://cdn-uploads.huggingface.co/production/uploads/64054e5e0ab5e22719fc179f/CALqiEjv1HahbLnZrbLPi.png)
+## 🛠️ Repository Contents
+This repository contains checkpoints compatible with both major frameworks:
+- **JAX/Flax:** The original training checkpoints (Orbit/Orbax format).
+- **PyTorch:** Converted weights for easier integration with the Hugging Face ecosystem (Safetensors).
+## 💻 Inference & Usage
+For inference code, architectural details, and conversion scripts, please visit the official GitHub repository:
+👉 **[https://github.com/sidharth72/Q-MoE-400]**
+To run the model, you will likely need the custom modeling code provided in the GitHub repo, as this uses a specialized sparse MoE architecture.
+## ⚙️ Training Details
+- **Architecture:** Sparse Mixture of Experts (Transformer Decoder)
+- **Parameters:** ~400M (Total), significantly fewer active parameters per token.
+- **Dataset:** OpenWebText
+- **Hardware:** 8 x TPU v3
+- **Framework:** JAX / Flax
+## 📜 Citation
+If you find this model or the associated research useful, please cite:
+```bibtex
+@misc{q-moe-400,
+  author = {Your Name/Organization},
+  title = {Q-MoE-400: A Sparse Mixture of Experts Model},
+  year = {2025},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Repository},
+  howpublished = {\url{[https://huggingface.co/your-username/Q-MoE-400](https://huggingface.co/your-username/Q-MoE-400)}}
+}