QMoE-400

QMoE-400 is a 400 million parameter Sparse Mixture of Experts (MoE) model trained on the OpenWebText dataset using JAX/Flax on 8 TPU v3 chips.

This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs.

πŸ’» Usage

You can use this model directly with the Hugging Face transformers library. Since this model uses a custom architecture, trust_remote_code=True is required.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

path = "QuarkML/QMoE-400"

# Load tokenizer and model
tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    path,
    trust_remote_code=True,
    dtype=torch.float16,  # optional but recommended for GPU
    device_map="auto"     # automatically maps to available device (CUDA/CPU)
)

# Generate text
text = """

Title: Why Simplicity Matters in Software Design

Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile.

Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication.

"""
inputs = tok(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

out = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=1
)

print(tok.decode(out[0], skip_special_tokens=True))

🎯 Project Goal

The primary goal of the Q-MoE project is to investigate:

  1. Compute Efficiency: Analyzing how sparse MoE models scale compared to dense counterparts with similar active parameter counts.
  2. Routing Dynamics: Studying load balancing and expert specialization during pre-training.
  3. Interoperability: Providing a bridge between research frameworks (JAX/Flax) and accessible inference (PyTorch).

πŸ“Š Training Metrics

The model was evaluated at step 79,100. The final validation metrics indicate stable routing and convergence on the OpenWebText validation split.

Metric Value Description
Step 79,100 Total training steps
Train Loss 3.2190 Total training loss (CE + Aux)
Train CE 3.0987 Cross-Entropy loss on training data
Val Loss 3.2028 Total validation loss
Val CE 3.0825 Cross-Entropy loss on validation data
Router Loss 0.1202 Auxiliary load-balancing loss
Dropped Tokens 0.0 No tokens dropped (perfect capacity utilization)

Training Progress

Training Curve

πŸ“ Generation Example

The following example demonstrates the model's generation capabilities after training.

Prompt:

Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile. Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication.

Model Prediction:

is a key design feature in many software systems, from the ubiquitous Open Source Unix to the free software system to the proprietary Java APIs that allow a developer to do any other job at work.

While this is an obvious disadvantage for software development, the reality is that there are many aspects of software that are highly important to a programmer's day-to-day life. This is why even a moderately experienced programmer should never be concerned about this.

The best way to learn about your code is by going through its source code. That way, it's always safe to do something new when writing code. This gives your programmer freedom and confidence.

βš™οΈ Training Details

  • Architecture: Sparse Mixture of Experts (Transformer Decoder)
  • Parameters: ~400M (Total), significantly fewer active parameters per token.
  • Dataset: OpenWebText
  • Hardware: 8 x TPU v3
  • Framework: JAX / Flax

πŸ“œ Citation

If you find this model or the associated research useful, please cite:

@misc{qmoe-400,
  author = {Quark Machine Learning},
  title = {QMoE-400: A Sparse Mixture of Experts Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Repository},
  howpublished = {\url{https://huggingface.co/QuarkML/QMoE-400}}
}
Downloads last month
28
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support