File size: 4,493 Bytes
f35adfe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7766680
 
f35adfe
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: apache-2.0
language:
- en
tags:
- mixture-of-experts
- mixture-of-recursions
- causal-lm
- custom-architecture
- pytorch
base_model: Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
---

# HybridMoRMoE β€” Hybrid Mixture-of-Recursions & Mixture-of-Experts

A custom causal language model combining **Mixture-of-Recursions (MoR)** with **Mixture-of-Experts (MoE)** routing, built from scratch in PyTorch and trained via a three-stage pipeline (pre-training β†’ SFT β†’ GRPO).

---

## Architecture

| Hyperparameter | Value |
|---|---|
| Model type | `hybrid_mor_moe` |
| Hidden dim (`d_model`) | 576 |
| Feed-forward dim (`d_ff`) | 1536 |
| Attention heads | 8 |
| Base layers | 6 |
| Shared recursive blocks | 6 |
| Unique last layers | 2 |
| Total transformer depth | 30 |
| Number of experts | 4 |
| Experts per token | 1 |
| Max recursions | 3 |
| Router percentile | 0.70 |
| Sequence length | 4096 |
| Vocabulary size | 151,665 |
| Tokenizer | Qwen2Tokenizer (Qwen2.5 compatible) |

**Key design choices:**
- Shared weight blocks are recursively applied based on a learned complexity score
- A per-token MoE router selects which expert processes each position
- Auxiliary routing loss (`router_aux_loss_coef = 1e-4`) encourages load balance
- Chat template follows the ChatML (`<|im_start|>` / `<|im_end|>`) format

---

## Training Pipeline

The model was trained in three sequential stages on a single NVIDIA P100 (16 GB HBM2):

| Stage | Method | Notes |
|---|---|---|
| 1 | **Pre-training** | Causal LM on open-domain text |
| 2 | **SFT** (Supervised Fine-Tuning) | Instruction following with packing |
| 3 | **GRPO** (Group Relative Policy Optimisation) | Reinforcement learning from preference signal |

Training used FP16 precision throughout (P100 has no BF16 support).

---

## Usage

Because this model uses a **custom architecture** not registered in the Hugging Face Transformers library by default, you must load the modelling code alongside the weights.

### Quick inference

```python
import torch
from transformers import AutoTokenizer

# 1. Clone / download this repo
# 2. Make sure hybrid_mor_moe_training.py is on your Python path
#    (it registers HybridMoRMoEForCausalLM & HybridMoRMoEConfig with AutoModel)

from hybrid_mor_moe_training import HybridMoRMoEConfig, HybridMoRMoEForCausalLM

model_path = "TorchLLM/HybridMoRMoE"  # or local path

config = HybridMoRMoEConfig.from_pretrained(model_path)
model  = HybridMoRMoEForCausalLM.from_pretrained(model_path, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_path)

model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

messages = [
    {"role": "user", "content": "Explain the difference between MoE and dense transformers."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    out = model.simple_generate(
        inputs["input_ids"],
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

### Environment setup

```bash
pip install torch transformers trl datasets accelerate
```

> **HF_TOKEN**: If you need to access gated datasets during re-training, export your token:
> ```bash
> export HF_TOKEN="your_token_here"
> ```
> Never hard-code tokens in source files.

---

## Repository Structure

```
TorchLLM/HybridMoRMoE/
β”œβ”€β”€ config.json                  # Model architecture config
β”œβ”€β”€ generation_config.json       # Default generation settings
β”œβ”€β”€ model.safetensors            # Trained weights (SafeTensors format)
β”œβ”€β”€ tokenizer.json               # Tokenizer vocabulary & rules
β”œβ”€β”€ tokenizer_config.json        # Tokenizer metadata
β”œβ”€β”€ chat_template.jinja          # ChatML chat template
└── hybrid_mor_moe_training.py   # Full training pipeline source
```

---

## Citation

If you use this model or training code in your research, please cite:

```bibtex
@misc{hybridmormoe2025,
  title  = {HybridMoRMoE: Combining Mixture-of-Recursions and Mixture-of-Experts for Efficient Causal LM},
  author = {Abhishek Gandhi},
  year   = {2026},
  url    = {https://huggingface.co/TorchLLM/HybridMoRMoE}
}
```

---

## License

Apache 2.0 β€” see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.