QMoE-400
QMoE-400 is a 400 million parameter Sparse Mixture of Experts (MoE) model trained on the OpenWebText dataset using JAX/Flax on 8 TPU v3 chips.
This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs.
π» Usage
You can use this model directly with the Hugging Face transformers library. Since this model uses a custom architecture, trust_remote_code=True is required.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
path = "QuarkML/QMoE-400"
# Load tokenizer and model
tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
path,
trust_remote_code=True,
dtype=torch.float16, # optional but recommended for GPU
device_map="auto" # automatically maps to available device (CUDA/CPU)
)
# Generate text
text = """
Title: Why Simplicity Matters in Software Design
Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile.
Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication.
"""
inputs = tok(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
out = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=1
)
print(tok.decode(out[0], skip_special_tokens=True))
π― Project Goal
The primary goal of the Q-MoE project is to investigate:
- Compute Efficiency: Analyzing how sparse MoE models scale compared to dense counterparts with similar active parameter counts.
- Routing Dynamics: Studying load balancing and expert specialization during pre-training.
- Interoperability: Providing a bridge between research frameworks (JAX/Flax) and accessible inference (PyTorch).
π Training Metrics
The model was evaluated at step 79,100. The final validation metrics indicate stable routing and convergence on the OpenWebText validation split.
| Metric | Value | Description |
|---|---|---|
| Step | 79,100 | Total training steps |
| Train Loss | 3.2190 | Total training loss (CE + Aux) |
| Train CE | 3.0987 | Cross-Entropy loss on training data |
| Val Loss | 3.2028 | Total validation loss |
| Val CE | 3.0825 | Cross-Entropy loss on validation data |
| Router Loss | 0.1202 | Auxiliary load-balancing loss |
| Dropped Tokens | 0.0 | No tokens dropped (perfect capacity utilization) |
Training Progress
π Generation Example
The following example demonstrates the model's generation capabilities after training.
Prompt:
Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile. Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication.
Model Prediction:
is a key design feature in many software systems, from the ubiquitous Open Source Unix to the free software system to the proprietary Java APIs that allow a developer to do any other job at work.
While this is an obvious disadvantage for software development, the reality is that there are many aspects of software that are highly important to a programmer's day-to-day life. This is why even a moderately experienced programmer should never be concerned about this.
The best way to learn about your code is by going through its source code. That way, it's always safe to do something new when writing code. This gives your programmer freedom and confidence.
βοΈ Training Details
- Architecture: Sparse Mixture of Experts (Transformer Decoder)
- Parameters: ~400M (Total), significantly fewer active parameters per token.
- Dataset: OpenWebText
- Hardware: 8 x TPU v3
- Framework: JAX / Flax
π Citation
If you find this model or the associated research useful, please cite:
@misc{qmoe-400,
author = {Quark Machine Learning},
title = {QMoE-400: A Sparse Mixture of Experts Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Repository},
howpublished = {\url{https://huggingface.co/QuarkML/QMoE-400}}
}
- Downloads last month
- 28
