File size: 4,493 Bytes
f35adfe 7766680 f35adfe | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
license: apache-2.0
language:
- en
tags:
- mixture-of-experts
- mixture-of-recursions
- causal-lm
- custom-architecture
- pytorch
base_model: Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
---
# HybridMoRMoE β Hybrid Mixture-of-Recursions & Mixture-of-Experts
A custom causal language model combining **Mixture-of-Recursions (MoR)** with **Mixture-of-Experts (MoE)** routing, built from scratch in PyTorch and trained via a three-stage pipeline (pre-training β SFT β GRPO).
---
## Architecture
| Hyperparameter | Value |
|---|---|
| Model type | `hybrid_mor_moe` |
| Hidden dim (`d_model`) | 576 |
| Feed-forward dim (`d_ff`) | 1536 |
| Attention heads | 8 |
| Base layers | 6 |
| Shared recursive blocks | 6 |
| Unique last layers | 2 |
| Total transformer depth | 30 |
| Number of experts | 4 |
| Experts per token | 1 |
| Max recursions | 3 |
| Router percentile | 0.70 |
| Sequence length | 4096 |
| Vocabulary size | 151,665 |
| Tokenizer | Qwen2Tokenizer (Qwen2.5 compatible) |
**Key design choices:**
- Shared weight blocks are recursively applied based on a learned complexity score
- A per-token MoE router selects which expert processes each position
- Auxiliary routing loss (`router_aux_loss_coef = 1e-4`) encourages load balance
- Chat template follows the ChatML (`<|im_start|>` / `<|im_end|>`) format
---
## Training Pipeline
The model was trained in three sequential stages on a single NVIDIA P100 (16 GB HBM2):
| Stage | Method | Notes |
|---|---|---|
| 1 | **Pre-training** | Causal LM on open-domain text |
| 2 | **SFT** (Supervised Fine-Tuning) | Instruction following with packing |
| 3 | **GRPO** (Group Relative Policy Optimisation) | Reinforcement learning from preference signal |
Training used FP16 precision throughout (P100 has no BF16 support).
---
## Usage
Because this model uses a **custom architecture** not registered in the Hugging Face Transformers library by default, you must load the modelling code alongside the weights.
### Quick inference
```python
import torch
from transformers import AutoTokenizer
# 1. Clone / download this repo
# 2. Make sure hybrid_mor_moe_training.py is on your Python path
# (it registers HybridMoRMoEForCausalLM & HybridMoRMoEConfig with AutoModel)
from hybrid_mor_moe_training import HybridMoRMoEConfig, HybridMoRMoEForCausalLM
model_path = "TorchLLM/HybridMoRMoE" # or local path
config = HybridMoRMoEConfig.from_pretrained(model_path)
model = HybridMoRMoEForCausalLM.from_pretrained(model_path, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
messages = [
{"role": "user", "content": "Explain the difference between MoE and dense transformers."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
out = model.simple_generate(
inputs["input_ids"],
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
### Environment setup
```bash
pip install torch transformers trl datasets accelerate
```
> **HF_TOKEN**: If you need to access gated datasets during re-training, export your token:
> ```bash
> export HF_TOKEN="your_token_here"
> ```
> Never hard-code tokens in source files.
---
## Repository Structure
```
TorchLLM/HybridMoRMoE/
βββ config.json # Model architecture config
βββ generation_config.json # Default generation settings
βββ model.safetensors # Trained weights (SafeTensors format)
βββ tokenizer.json # Tokenizer vocabulary & rules
βββ tokenizer_config.json # Tokenizer metadata
βββ chat_template.jinja # ChatML chat template
βββ hybrid_mor_moe_training.py # Full training pipeline source
```
---
## Citation
If you use this model or training code in your research, please cite:
```bibtex
@misc{hybridmormoe2025,
title = {HybridMoRMoE: Combining Mixture-of-Recursions and Mixture-of-Experts for Efficient Causal LM},
author = {Abhishek Gandhi},
year = {2026},
url = {https://huggingface.co/TorchLLM/HybridMoRMoE}
}
```
---
## License
Apache 2.0 β see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.
|