| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - mixture-of-experts |
| - mixture-of-recursions |
| - causal-lm |
| - custom-architecture |
| - pytorch |
| base_model: Qwen/Qwen2.5-0.5B-Instruct |
| pipeline_tag: text-generation |
| --- |
| |
| # HybridMoRMoE β Hybrid Mixture-of-Recursions & Mixture-of-Experts |
|
|
| A custom causal language model combining **Mixture-of-Recursions (MoR)** with **Mixture-of-Experts (MoE)** routing, built from scratch in PyTorch and trained via a three-stage pipeline (pre-training β SFT β GRPO). |
|
|
| --- |
|
|
| ## Architecture |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | Model type | `hybrid_mor_moe` | |
| | Hidden dim (`d_model`) | 576 | |
| | Feed-forward dim (`d_ff`) | 1536 | |
| | Attention heads | 8 | |
| | Base layers | 6 | |
| | Shared recursive blocks | 6 | |
| | Unique last layers | 2 | |
| | Total transformer depth | 30 | |
| | Number of experts | 4 | |
| | Experts per token | 1 | |
| | Max recursions | 3 | |
| | Router percentile | 0.70 | |
| | Sequence length | 4096 | |
| | Vocabulary size | 151,665 | |
| | Tokenizer | Qwen2Tokenizer (Qwen2.5 compatible) | |
|
|
| **Key design choices:** |
| - Shared weight blocks are recursively applied based on a learned complexity score |
| - A per-token MoE router selects which expert processes each position |
| - Auxiliary routing loss (`router_aux_loss_coef = 1e-4`) encourages load balance |
| - Chat template follows the ChatML (`<|im_start|>` / `<|im_end|>`) format |
|
|
| --- |
|
|
| ## Training Pipeline |
|
|
| The model was trained in three sequential stages on a single NVIDIA P100 (16 GB HBM2): |
|
|
| | Stage | Method | Notes | |
| |---|---|---| |
| | 1 | **Pre-training** | Causal LM on open-domain text | |
| | 2 | **SFT** (Supervised Fine-Tuning) | Instruction following with packing | |
| | 3 | **GRPO** (Group Relative Policy Optimisation) | Reinforcement learning from preference signal | |
|
|
| Training used FP16 precision throughout (P100 has no BF16 support). |
|
|
| --- |
|
|
| ## Usage |
|
|
| Because this model uses a **custom architecture** not registered in the Hugging Face Transformers library by default, you must load the modelling code alongside the weights. |
|
|
| ### Quick inference |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer |
| |
| # 1. Clone / download this repo |
| # 2. Make sure hybrid_mor_moe_training.py is on your Python path |
| # (it registers HybridMoRMoEForCausalLM & HybridMoRMoEConfig with AutoModel) |
| |
| from hybrid_mor_moe_training import HybridMoRMoEConfig, HybridMoRMoEForCausalLM |
| |
| model_path = "TorchLLM/HybridMoRMoE" # or local path |
| |
| config = HybridMoRMoEConfig.from_pretrained(model_path) |
| model = HybridMoRMoEForCausalLM.from_pretrained(model_path, config=config) |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| |
| model.eval() |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| model.to(device) |
| |
| messages = [ |
| {"role": "user", "content": "Explain the difference between MoE and dense transformers."} |
| ] |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tokenizer(text, return_tensors="pt").to(device) |
| |
| with torch.no_grad(): |
| out = model.simple_generate( |
| inputs["input_ids"], |
| max_new_tokens=256, |
| temperature=0.7, |
| top_p=0.9, |
| eos_token_id=tokenizer.eos_token_id, |
| ) |
| |
| print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| ### Environment setup |
|
|
| ```bash |
| pip install torch transformers trl datasets accelerate |
| ``` |
|
|
| > **HF_TOKEN**: If you need to access gated datasets during re-training, export your token: |
| > ```bash |
| > export HF_TOKEN="your_token_here" |
| > ``` |
| > Never hard-code tokens in source files. |
| |
| --- |
| |
| ## Repository Structure |
| |
| ``` |
| TorchLLM/HybridMoRMoE/ |
| βββ config.json # Model architecture config |
| βββ generation_config.json # Default generation settings |
| βββ model.safetensors # Trained weights (SafeTensors format) |
| βββ tokenizer.json # Tokenizer vocabulary & rules |
| βββ tokenizer_config.json # Tokenizer metadata |
| βββ chat_template.jinja # ChatML chat template |
| βββ hybrid_mor_moe_training.py # Full training pipeline source |
| ``` |
| |
| --- |
| |
| ## Citation |
| |
| If you use this model or training code in your research, please cite: |
| |
| ```bibtex |
| @misc{hybridmormoe2025, |
| title = {HybridMoRMoE: Combining Mixture-of-Recursions and Mixture-of-Experts for Efficient Causal LM}, |
| author = {Abhishek Gandhi}, |
| year = {2026}, |
| url = {https://huggingface.co/TorchLLM/HybridMoRMoE} |
| } |
| ``` |
| |
| --- |
| |
| ## License |
| |
| Apache 2.0 β see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details. |
| |