--- license: apache-2.0 language: - en tags: - moe - sparse-mixture-of-experts - jax - flax - pytorch - text-generation - openwebtext - custom_code --- # QMoE-400 **QMoE-400** is a 400 million parameter Lightweight Sparse Mixture of Experts (MoE) model trained on the OpenWebText dataset using JAX/Flax on 8 TPU v3 chips. This model serves as a research artifact for studying the compute efficiency of sparse architectures compared to dense transformers. It demonstrates how routing mechanisms can enable high-capacity models with lower inference costs. ## 💻 Usage You can use this model directly with the Hugging Face `transformers` library. Since this model uses a custom architecture, `trust_remote_code=True` is required. ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch path = "QuarkML/QMoE-400" # Load tokenizer and model tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( path, trust_remote_code=True, dtype=torch.float16, # optional but recommended for GPU device_map="auto" # automatically maps to available device (CUDA/CPU) ) # Generate text text = """ Title: Why Simplicity Matters in Software Design Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile. Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication. """ inputs = tok(text, return_tensors="pt") inputs = {k: v.to(model.device) for k, v in inputs.items()} out = model.generate( **inputs, max_new_tokens=50, do_sample=True, temperature=1 ) print(tok.decode(out[0], skip_special_tokens=True)) ``` ## 🎯 Project Goal The primary goal of the Q-MoE project is to investigate: 1. **Compute Efficiency:** Analyzing how sparse MoE models scale compared to dense counterparts with similar active parameter counts. 2. **Routing Dynamics:** Studying load balancing and expert specialization during pre-training. 3. **Interoperability:** Providing a bridge between research frameworks (JAX/Flax) and accessible inference (PyTorch). ## 📊 Training Metrics The model was evaluated at step **79,100**. The final validation metrics indicate stable routing and convergence on the OpenWebText validation split. | Metric | Value | Description | | :--- | :--- | :--- | | **Step** | 79,100 | Total training steps | | **Train Loss** | 3.2190 | Total training loss (CE + Aux) | | **Train CE** | 3.0987 | Cross-Entropy loss on training data | | **Val Loss** | 3.2028 | Total validation loss | | **Val CE** | 3.0825 | Cross-Entropy loss on validation data | | **Router Loss** | 0.1202 | Auxiliary load-balancing loss | | **Dropped Tokens** | 0.0 | No tokens dropped (perfect capacity utilization) | ### Training Progress ![Training Curve](https://cdn-uploads.huggingface.co/production/uploads/64054e5e0ab5e22719fc179f/CALqiEjv1HahbLnZrbLPi.png) ## 📝 Generation Example The following example demonstrates the model's generation capabilities after training. **Prompt:** > Many software systems become difficult to maintain not because the problems are hard, but because unnecessary complexity accumulates over time. Extra abstractions, premature optimizations, and unclear design choices often make systems fragile. > Experienced engineers tend to favor simple designs that are easy to understand, test, and evolve. Simplicity reflects clarity of thought and strong fundamentals rather than lack of sophistication. **Model Prediction:** > is a key design feature in many software systems, from the ubiquitous Open Source Unix to the free software system to the proprietary Java APIs that allow a developer to do any other job at work. > > While this is an obvious disadvantage for software development, the reality is that there are many aspects of software that are highly important to a programmer's day-to-day life. This is why even a moderately experienced programmer should never be concerned about this. > > The best way to learn about your code is by going through its source code. That way, it's always safe to do something new when writing code. This gives your programmer freedom and confidence. ## ⚙️ Training Details - **Architecture:** Sparse Mixture of Experts (Transformer Decoder) - **Parameters:** ~400M (Total), significantly fewer active parameters per token. - **Dataset:** OpenWebText - **Hardware:** 8 x TPU v3 - **Framework:** JAX / Flax ## 📜 Citation If you find this model or the associated research useful, please cite: ```bibtex @misc{qmoe-400, author = {Quark Machine Learning}, title = {QMoE-400: A Sparse Mixture of Experts Model}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Repository}, howpublished = {\url{https://huggingface.co/QuarkML/QMoE-400}} } ```