QMoE-400

QMoE-400 is a 400 million parameter Lightweight Sparse Mixture of Experts (MoE) model trained on the OpenWebText dataset using JAX/Flax on 8 TPU v3 chips.

This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs.

💻 Usage

You can use this model directly with the Hugging Face transformers library. Since this model uses a custom architecture, trust_remote_code=True is required.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

path = "QuarkML/QMoE-400"

# Load tokenizer and model
tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    path,
    trust_remote_code=True,
    dtype=torch.float16,  # optional but recommended for GPU
    device_map="auto"     # automatically maps to available device (CUDA/CPU)
)

# Generate text
text = """

Title: Why Simplicity Matters in Software Design

Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile.

Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication.

"""
inputs = tok(text, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

out = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=1
)

print(tok.decode(out[0], skip_special_tokens=True))

🎯 Project Goal

The primary goal of the Q-MoE project is to investigate:

Compute Efficiency: Analyzing how sparse MoE models scale compared to dense counterparts with similar active parameter counts.
Routing Dynamics: Studying load balancing and expert specialization during pre-training.
Interoperability: Providing a bridge between research frameworks (JAX/Flax) and accessible inference (PyTorch).

📊 Training Metrics

The model was evaluated at step 79,100. The final validation metrics indicate stable routing and convergence on the OpenWebText validation split.

Metric	Value	Description
Step	79,100	Total training steps
Train Loss	3.2190	Total training loss (CE + Aux)
Train CE	3.0987	Cross-Entropy loss on training data
Val Loss	3.2028	Total validation loss
Val CE	3.0825	Cross-Entropy loss on validation data
Router Loss	0.1202	Auxiliary load-balancing loss
Dropped Tokens	0.0	No tokens dropped (perfect capacity utilization)

Training Progress

📝 Generation Example

The following example demonstrates the model's generation capabilities after training.

Prompt:

Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile. Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication.

Model Prediction:

is a key design feature in many software systems, from the ubiquitous Open Source Unix to the free software system to the proprietary Java APIs that allow a developer to do any other job at work.

While this is an obvious disadvantage for software development, the reality is that there are many aspects of software that are highly important to a programmer's day-to-day life. This is why even a moderately experienced programmer should never be concerned about this.

The best way to learn about your code is by going through its source code. That way, it's always safe to do something new when writing code. This gives your programmer freedom and confidence.

⚙️ Training Details

Architecture: Sparse Mixture of Experts (Transformer Decoder)
Parameters: ~400M (Total), significantly fewer active parameters per token.
Dataset: OpenWebText
Hardware: 8 x TPU v3
Framework: JAX / Flax

📜 Citation

If you find this model or the associated research useful, please cite:

@misc{qmoe-400,
  author = {Quark Machine Learning},
  title = {QMoE-400: A Sparse Mixture of Experts Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Repository},
  howpublished = {\url{https://huggingface.co/QuarkML/QMoE-400}}
}

Downloads last month: 18

Safetensors

Model size

0.4B params

Tensor type

F32