File size: 5,083 Bytes
cc79560 79ee4f3 cc79560 79ee4f3 7e0d858 79ee4f3 93123fb 79ee4f3 cc79560 7e0d858 cc79560 530dbc9 cc79560 caadc53 cc79560 79ee4f3 cc79560 79ee4f3 cc79560 79ee4f3 72590c0 79ee4f3 7e0d858 20ea2a2 7e0d858 79ee4f3 7e0d858 cc79560 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
license: apache-2.0
language:
- en
tags:
- moe
- sparse-mixture-of-experts
- jax
- flax
- pytorch
- text-generation
- openwebtext
- custom_code
---
# QMoE-400
**QMoE-400** is a 400 million parameter Lightweight Sparse Mixture of Experts (MoE) model trained on the OpenWebText dataset using JAX/Flax on 8 TPU v3 chips.
This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs.
## π» Usage
You can use this model directly with the Hugging Face `transformers` library. Since this model uses a custom architecture, `trust_remote_code=True` is required.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
path = "QuarkML/QMoE-400"
# Load tokenizer and model
tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
path,
trust_remote_code=True,
dtype=torch.float16, # optional but recommended for GPU
device_map="auto" # automatically maps to available device (CUDA/CPU)
)
# Generate text
text = """
Title: Why Simplicity Matters in Software Design
Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile.
Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication.
"""
inputs = tok(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
out = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=1
)
print(tok.decode(out[0], skip_special_tokens=True))
```
## π― Project Goal
The primary goal of the Q-MoE project is to investigate:
1. **Compute Efficiency:** Analyzing how sparse MoE models scale compared to dense counterparts with similar active parameter counts.
2. **Routing Dynamics:** Studying load balancing and expert specialization during pre-training.
3. **Interoperability:** Providing a bridge between research frameworks (JAX/Flax) and accessible inference (PyTorch).
## π Training Metrics
The model was evaluated at step **79,100**. The final validation metrics indicate stable routing and convergence on the OpenWebText validation split.
| Metric | Value | Description |
| :--- | :--- | :--- |
| **Step** | 79,100 | Total training steps |
| **Train Loss** | 3.2190 | Total training loss (CE + Aux) |
| **Train CE** | 3.0987 | Cross-Entropy loss on training data |
| **Val Loss** | 3.2028 | Total validation loss |
| **Val CE** | 3.0825 | Cross-Entropy loss on validation data |
| **Router Loss** | 0.1202 | Auxiliary load-balancing loss |
| **Dropped Tokens** | 0.0 | No tokens dropped (perfect capacity utilization) |
### Training Progress

## π Generation Example
The following example demonstrates the model's generation capabilities after training.
**Prompt:**
> Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile.
> Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication.
**Model Prediction:**
> is a key design feature in many software systems, from the ubiquitous Open Source Unix to the free software system to the proprietary Java APIs that allow a developer to do any other job at work.
>
> While this is an obvious disadvantage for software development, the reality is that there are many aspects of software that are highly important to a programmer's day-to-day life. This is why even a moderately experienced programmer should never be concerned about this.
>
> The best way to learn about your code is by going through its source code. That way, it's always safe to do something new when writing code. This gives your programmer freedom and confidence.
## βοΈ Training Details
- **Architecture:** Sparse Mixture of Experts (Transformer Decoder)
- **Parameters:** ~400M (Total), significantly fewer active parameters per token.
- **Dataset:** OpenWebText
- **Hardware:** 8 x TPU v3
- **Framework:** JAX / Flax
## π Citation
If you find this model or the associated research useful, please cite:
```bibtex
@misc{qmoe-400,
author = {Quark Machine Learning},
title = {QMoE-400: A Sparse Mixture of Experts Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Repository},
howpublished = {\url{https://huggingface.co/QuarkML/QMoE-400}}
}
```
|