Sidharthan commited on
Commit
79ee4f3
·
verified ·
1 Parent(s): 52af60d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -3
README.md CHANGED
@@ -1,3 +1,81 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - moe
7
+ - sparse-mixture-of-experts
8
+ - jax
9
+ - flax
10
+ - pytorch
11
+ - text-generation
12
+ - openwebtext
13
+ ---
14
+
15
+ # Q-MoE-400
16
+
17
+ **Q-MoE-400** is a 400 million parameter Sparse Mixture of Experts (MoE) model trained on the OpenWebText dataset using JAX/Flax on 8 TPU v3 chips.
18
+
19
+ This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs.
20
+
21
+ ## 🎯 Project Goal
22
+
23
+ The primary goal of the Q-MoE project is to investigate:
24
+ 1. **Compute Efficiency:** Analyzing how sparse MoE models scale compared to dense counterparts with similar active parameter counts.
25
+ 2. **Routing Dynamics:** Studying load balancing and expert specialization during pre-training.
26
+ 3. **Interoperability:** Providing a bridge between research frameworks (JAX/Flax) and accessible inference (PyTorch).
27
+
28
+ ## 📊 Training Metrics
29
+
30
+ The model was evaluated at step **79,100**. The final validation metrics indicate stable routing and convergence on the OpenWebText validation split.
31
+
32
+ | Metric | Value | Description |
33
+ | :--- | :--- | :--- |
34
+ | **Step** | 79,100 | Total training steps |
35
+ | **Train Loss** | 3.2190 | Total training loss (CE + Aux) |
36
+ | **Train CE** | 3.0987 | Cross-Entropy loss on training data |
37
+ | **Val Loss** | 3.2028 | Total validation loss |
38
+ | **Val CE** | 3.0825 | Cross-Entropy loss on validation data |
39
+ | **Router Loss** | 0.1202 | Auxiliary load-balancing loss |
40
+ | **Dropped Tokens** | 0.0 | No tokens dropped (perfect capacity utilization) |
41
+
42
+ ### Training Progress
43
+ *(Upload your validation loss chart here to visualize the learning curve)*
44
+
45
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/64054e5e0ab5e22719fc179f/CALqiEjv1HahbLnZrbLPi.png)
46
+
47
+ ## 🛠️ Repository Contents
48
+
49
+ This repository contains checkpoints compatible with both major frameworks:
50
+ - **JAX/Flax:** The original training checkpoints (Orbit/Orbax format).
51
+ - **PyTorch:** Converted weights for easier integration with the Hugging Face ecosystem (Safetensors).
52
+
53
+ ## 💻 Inference & Usage
54
+
55
+ For inference code, architectural details, and conversion scripts, please visit the official GitHub repository:
56
+
57
+ 👉 **[https://github.com/sidharth72/Q-MoE-400]**
58
+
59
+ To run the model, you will likely need the custom modeling code provided in the GitHub repo, as this uses a specialized sparse MoE architecture.
60
+
61
+ ## ⚙️ Training Details
62
+
63
+ - **Architecture:** Sparse Mixture of Experts (Transformer Decoder)
64
+ - **Parameters:** ~400M (Total), significantly fewer active parameters per token.
65
+ - **Dataset:** OpenWebText
66
+ - **Hardware:** 8 x TPU v3
67
+ - **Framework:** JAX / Flax
68
+
69
+ ## 📜 Citation
70
+
71
+ If you find this model or the associated research useful, please cite:
72
+
73
+ ```bibtex
74
+ @misc{q-moe-400,
75
+ author = {Your Name/Organization},
76
+ title = {Q-MoE-400: A Sparse Mixture of Experts Model},
77
+ year = {2025},
78
+ publisher = {Hugging Face},
79
+ journal = {Hugging Face Repository},
80
+ howpublished = {\url{[https://huggingface.co/your-username/Q-MoE-400](https://huggingface.co/your-username/Q-MoE-400)}}
81
+ }