File size: 4,493 Bytes

---
license: apache-2.0
language:
- en
tags:
- mixture-of-experts
- mixture-of-recursions
- causal-lm
- custom-architecture
- pytorch
base_model: Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
---

# HybridMoRMoE — Hybrid Mixture-of-Recursions & Mixture-of-Experts

A custom causal language model combining **Mixture-of-Recursions (MoR)** with **Mixture-of-Experts (MoE)** routing, built from scratch in PyTorch and trained via a three-stage pipeline (pre-training → SFT → GRPO).

---

## Architecture

| Hyperparameter | Value |
|---|---|
| Model type | `hybrid_mor_moe` |
| Hidden dim (`d_model`) | 576 |
| Feed-forward dim (`d_ff`) | 1536 |
| Attention heads | 8 |
| Base layers | 6 |
| Shared recursive blocks | 6 |
| Unique last layers | 2 |
| Total transformer depth | 30 |
| Number of experts | 4 |
| Experts per token | 1 |
| Max recursions | 3 |
| Router percentile | 0.70 |
| Sequence length | 4096 |
| Vocabulary size | 151,665 |
| Tokenizer | Qwen2Tokenizer (Qwen2.5 compatible) |

**Key design choices:**
- Shared weight blocks are recursively applied based on a learned complexity score
- A per-token MoE router selects which expert processes each position
- Auxiliary routing loss (`router_aux_loss_coef = 1e-4`) encourages load balance
- Chat template follows the ChatML (`<|im_start|>` / `<|im_end|>`) format

---

## Training Pipeline

The model was trained in three sequential stages on a single NVIDIA P100 (16 GB HBM2):

| Stage | Method | Notes |
|---|---|---|
| 1 | **Pre-training** | Causal LM on open-domain text |
| 2 | **SFT** (Supervised Fine-Tuning) | Instruction following with packing |
| 3 | **GRPO** (Group Relative Policy Optimisation) | Reinforcement learning from preference signal |

Training used FP16 precision throughout (P100 has no BF16 support).

---

## Usage

Because this model uses a **custom architecture** not registered in the Hugging Face Transformers library by default, you must load the modelling code alongside the weights.

### Quick inference

```python
import torch
from transformers import AutoTokenizer

# 1. Clone / download this repo
# 2. Make sure hybrid_mor_moe_training.py is on your Python path
#    (it registers HybridMoRMoEForCausalLM & HybridMoRMoEConfig with AutoModel)

from hybrid_mor_moe_training import HybridMoRMoEConfig, HybridMoRMoEForCausalLM

model_path = "TorchLLM/HybridMoRMoE"  # or local path

config = HybridMoRMoEConfig.from_pretrained(model_path)
model  = HybridMoRMoEForCausalLM.from_pretrained(model_path, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_path)

model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

messages = [
    {"role": "user", "content": "Explain the difference between MoE and dense transformers."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    out = model.simple_generate(
        inputs["input_ids"],
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

### Environment setup

```bash
pip install torch transformers trl datasets accelerate
```

> **HF_TOKEN**: If you need to access gated datasets during re-training, export your token:
> ```bash
> export HF_TOKEN="your_token_here"
> ```
> Never hard-code tokens in source files.

---

## Repository Structure

```
TorchLLM/HybridMoRMoE/
├── config.json                  # Model architecture config
├── generation_config.json       # Default generation settings
├── model.safetensors            # Trained weights (SafeTensors format)
├── tokenizer.json               # Tokenizer vocabulary & rules
├── tokenizer_config.json        # Tokenizer metadata
├── chat_template.jinja          # ChatML chat template
└── hybrid_mor_moe_training.py   # Full training pipeline source
```

---

## Citation

If you use this model or training code in your research, please cite:

```bibtex
@misc{hybridmormoe2025,
  title  = {HybridMoRMoE: Combining Mixture-of-Recursions and Mixture-of-Experts for Efficient Causal LM},
  author = {Abhishek Gandhi},
  year   = {2026},
  url    = {https://huggingface.co/TorchLLM/HybridMoRMoE}
}
```

---

## License

Apache 2.0 — see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.