HybridMoRMoE — Hybrid Mixture-of-Recursions & Mixture-of-Experts

A custom causal language model combining Mixture-of-Recursions (MoR) with Mixture-of-Experts (MoE) routing, built from scratch in PyTorch and trained via a three-stage pipeline (pre-training → SFT → GRPO).

Architecture

Hyperparameter	Value
Model type	`hybrid_mor_moe`
Hidden dim (`d_model`)	576
Feed-forward dim (`d_ff`)	1536
Attention heads	8
Base layers	6
Shared recursive blocks	6
Unique last layers	2
Total transformer depth	30
Number of experts	4
Experts per token	1
Max recursions	3
Router percentile	0.70
Sequence length	4096
Vocabulary size	151,665
Tokenizer	Qwen2Tokenizer (Qwen2.5 compatible)

Key design choices:

Shared weight blocks are recursively applied based on a learned complexity score
A per-token MoE router selects which expert processes each position
Auxiliary routing loss (router_aux_loss_coef = 1e-4) encourages load balance
Chat template follows the ChatML (<|im_start|> / <|im_end|>) format

Training Pipeline

The model was trained in three sequential stages on a single NVIDIA P100 (16 GB HBM2):

Stage	Method	Notes
1	Pre-training	Causal LM on open-domain text
2	SFT (Supervised Fine-Tuning)	Instruction following with packing
3	GRPO (Group Relative Policy Optimisation)	Reinforcement learning from preference signal

Training used FP16 precision throughout (P100 has no BF16 support).

Usage

Because this model uses a custom architecture not registered in the Hugging Face Transformers library by default, you must load the modelling code alongside the weights.

Quick inference

import torch
from transformers import AutoTokenizer

# 1. Clone / download this repo
# 2. Make sure hybrid_mor_moe_training.py is on your Python path
#    (it registers HybridMoRMoEForCausalLM & HybridMoRMoEConfig with AutoModel)

from hybrid_mor_moe_training import HybridMoRMoEConfig, HybridMoRMoEForCausalLM

model_path = "TorchLLM/HybridMoRMoE"  # or local path

config = HybridMoRMoEConfig.from_pretrained(model_path)
model  = HybridMoRMoEForCausalLM.from_pretrained(model_path, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_path)

model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

messages = [
    {"role": "user", "content": "Explain the difference between MoE and dense transformers."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    out = model.simple_generate(
        inputs["input_ids"],
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Environment setup

pip install torch transformers trl datasets accelerate

HF_TOKEN: If you need to access gated datasets during re-training, export your token:
export HF_TOKEN="your_token_here"
Never hard-code tokens in source files.

Repository Structure

TorchLLM/HybridMoRMoE/
├── config.json                  # Model architecture config
├── generation_config.json       # Default generation settings
├── model.safetensors            # Trained weights (SafeTensors format)
├── tokenizer.json               # Tokenizer vocabulary & rules
├── tokenizer_config.json        # Tokenizer metadata
├── chat_template.jinja          # ChatML chat template
└── hybrid_mor_moe_training.py   # Full training pipeline source

Citation

If you use this model or training code in your research, please cite:

@misc{hybridmormoe2025,
  title  = {HybridMoRMoE: Combining Mixture-of-Recursions and Mixture-of-Experts for Efficient Causal LM},
  author = {Abhishek Gandhi},
  year   = {2026},
  url    = {https://huggingface.co/TorchLLM/HybridMoRMoE}
}

License

Apache 2.0 — see LICENSE for details.

Downloads last month: 1,979

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for TorchLLM/HybridMoRMoE

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Finetuned

(790)

this model