Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,4 @@
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
|
@@ -10,6 +11,7 @@ tags:
|
|
| 10 |
- pytorch
|
| 11 |
- text-generation
|
| 12 |
- openwebtext
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
# Q-MoE-400
|
|
@@ -18,6 +20,39 @@ tags:
|
|
| 18 |
|
| 19 |
This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs.
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## ๐ฏ Project Goal
|
| 22 |
|
| 23 |
The primary goal of the Q-MoE project is to investigate:
|
|
@@ -31,7 +66,7 @@ The model was evaluated at step **79,100**. The final validation metrics indicat
|
|
| 31 |
|
| 32 |
| Metric | Value | Description |
|
| 33 |
| :--- | :--- | :--- |
|
| 34 |
-
| **Step** |
|
| 35 |
| **Train Loss** | 3.2190 | Total training loss (CE + Aux) |
|
| 36 |
| **Train CE** | 3.0987 | Cross-Entropy loss on training data |
|
| 37 |
| **Val Loss** | 3.2028 | Total validation loss |
|
|
@@ -41,7 +76,7 @@ The model was evaluated at step **79,100**. The final validation metrics indicat
|
|
| 41 |
|
| 42 |
### Training Progress
|
| 43 |
|
| 44 |
-
![
|
| 45 |
|
| 46 |
## ๐ Generation Example
|
| 47 |
|
|
@@ -57,24 +92,6 @@ The following example demonstrates the model's generation capabilities after tra
|
|
| 57 |
> While this is an obvious disadvantage for software development, the reality is that there are many aspects of software that are highly important to a programmer's day-to-day life. This is why even a moderately experienced programmer should never be concerned about this.
|
| 58 |
>
|
| 59 |
> The best way to learn about your code is by going through its source code. That way, it's always safe to do something new when writing code. This gives your programmer freedom and confidence.
|
| 60 |
-
>
|
| 61 |
-
> One of the most popular techniques for coding small functions in your computer is "code reuse." The same technique can be used by programmers in any number of different ways. Some programmers might write code to get the job done, and others develop it to get it to the end. They use the same tools as most programmers to get the job done.
|
| 62 |
-
>
|
| 63 |
-
> The best way to learn about your code is by going through its source code. That way, it's always safe to do something new when writing code. That technique can help you get the job done. It provides a way to write code that is easy to understand and maintain, and makes debugging easier.
|
| 64 |
-
|
| 65 |
-
## ๐ ๏ธ Repository Contents
|
| 66 |
-
|
| 67 |
-
This repository contains checkpoints compatible with both major frameworks:
|
| 68 |
-
- **JAX/Flax:** The original training checkpoints (Orbit/Orbax format).
|
| 69 |
-
- **PyTorch:** Converted weights for easier integration with the Hugging Face ecosystem (Safetensors).
|
| 70 |
-
|
| 71 |
-
## ๐ป Inference & Usage
|
| 72 |
-
|
| 73 |
-
For inference code, architectural details, training pipeline and conversion scripts, please visit the official GitHub repository:
|
| 74 |
-
|
| 75 |
-
๐ **[https://github.com/sidharth72/Q-MoE-400 ]**
|
| 76 |
-
|
| 77 |
-
To run the model, you will likely need the custom modeling code provided in the GitHub repo, as this uses a specialized sparse MoE architecture.
|
| 78 |
|
| 79 |
## โ๏ธ Training Details
|
| 80 |
|
|
@@ -95,5 +112,6 @@ If you find this model or the associated research useful, please cite:
|
|
| 95 |
year = {2025},
|
| 96 |
publisher = {Hugging Face},
|
| 97 |
journal = {Hugging Face Repository},
|
| 98 |
-
howpublished = {\url{
|
| 99 |
-
}
|
|
|
|
|
|
| 1 |
+
|
| 2 |
---
|
| 3 |
license: apache-2.0
|
| 4 |
language:
|
|
|
|
| 11 |
- pytorch
|
| 12 |
- text-generation
|
| 13 |
- openwebtext
|
| 14 |
+
- custom_code
|
| 15 |
---
|
| 16 |
|
| 17 |
# Q-MoE-400
|
|
|
|
| 20 |
|
| 21 |
This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs.
|
| 22 |
|
| 23 |
+
## ๐ป Usage
|
| 24 |
+
|
| 25 |
+
You can use this model directly with the Hugging Face `transformers` library. Since this model uses a custom architecture, `trust_remote_code=True` is required.
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 29 |
+
import torch
|
| 30 |
+
|
| 31 |
+
path = "QuarkML/Q-MoE-400"
|
| 32 |
+
|
| 33 |
+
# Load tokenizer and model
|
| 34 |
+
tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
|
| 35 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 36 |
+
path,
|
| 37 |
+
trust_remote_code=True,
|
| 38 |
+
dtype=torch.float16, # optional but recommended for GPU
|
| 39 |
+
device_map="auto" # automatically maps to available device (CUDA/CPU)
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
# Generate text
|
| 43 |
+
inputs = tok("Artificial Neural network are ", return_tensors="pt")
|
| 44 |
+
inputs = {k: v.to(model.device) for k, v in inputs.items()}
|
| 45 |
+
|
| 46 |
+
out = model.generate(
|
| 47 |
+
**inputs,
|
| 48 |
+
max_new_tokens=50,
|
| 49 |
+
do_sample=True,
|
| 50 |
+
temperature=0.8
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
print(tok.decode(out[0], skip_special_tokens=True))
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
## ๐ฏ Project Goal
|
| 57 |
|
| 58 |
The primary goal of the Q-MoE project is to investigate:
|
|
|
|
| 66 |
|
| 67 |
| Metric | Value | Description |
|
| 68 |
| :--- | :--- | :--- |
|
| 69 |
+
| **Step** | 79,100 | Total training steps |
|
| 70 |
| **Train Loss** | 3.2190 | Total training loss (CE + Aux) |
|
| 71 |
| **Train CE** | 3.0987 | Cross-Entropy loss on training data |
|
| 72 |
| **Val Loss** | 3.2028 | Total validation loss |
|
|
|
|
| 76 |
|
| 77 |
### Training Progress
|
| 78 |
|
| 79 |
+

|
| 80 |
|
| 81 |
## ๐ Generation Example
|
| 82 |
|
|
|
|
| 92 |
> While this is an obvious disadvantage for software development, the reality is that there are many aspects of software that are highly important to a programmer's day-to-day life. This is why even a moderately experienced programmer should never be concerned about this.
|
| 93 |
>
|
| 94 |
> The best way to learn about your code is by going through its source code. That way, it's always safe to do something new when writing code. This gives your programmer freedom and confidence.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
## โ๏ธ Training Details
|
| 97 |
|
|
|
|
| 112 |
year = {2025},
|
| 113 |
publisher = {Hugging Face},
|
| 114 |
journal = {Hugging Face Repository},
|
| 115 |
+
howpublished = {\url{https://huggingface.co/QuarkML/Q-MoE-400}}
|
| 116 |
+
}
|
| 117 |
+
```
|