HybridMoRMoE β€” Hybrid Mixture-of-Recursions & Mixture-of-Experts

A custom causal language model combining Mixture-of-Recursions (MoR) with Mixture-of-Experts (MoE) routing, built from scratch in PyTorch and trained via a three-stage pipeline (pre-training β†’ SFT β†’ GRPO).


Architecture

Hyperparameter Value
Model type hybrid_mor_moe
Hidden dim (d_model) 576
Feed-forward dim (d_ff) 1536
Attention heads 8
Base layers 6
Shared recursive blocks 6
Unique last layers 2
Total transformer depth 30
Number of experts 4
Experts per token 1
Max recursions 3
Router percentile 0.70
Sequence length 4096
Vocabulary size 151,665
Tokenizer Qwen2Tokenizer (Qwen2.5 compatible)

Key design choices:

  • Shared weight blocks are recursively applied based on a learned complexity score
  • A per-token MoE router selects which expert processes each position
  • Auxiliary routing loss (router_aux_loss_coef = 1e-4) encourages load balance
  • Chat template follows the ChatML (<|im_start|> / <|im_end|>) format

Training Pipeline

The model was trained in three sequential stages on a single NVIDIA P100 (16 GB HBM2):

Stage Method Notes
1 Pre-training Causal LM on open-domain text
2 SFT (Supervised Fine-Tuning) Instruction following with packing
3 GRPO (Group Relative Policy Optimisation) Reinforcement learning from preference signal

Training used FP16 precision throughout (P100 has no BF16 support).


Usage

Because this model uses a custom architecture not registered in the Hugging Face Transformers library by default, you must load the modelling code alongside the weights.

Quick inference

import torch
from transformers import AutoTokenizer

# 1. Clone / download this repo
# 2. Make sure hybrid_mor_moe_training.py is on your Python path
#    (it registers HybridMoRMoEForCausalLM & HybridMoRMoEConfig with AutoModel)

from hybrid_mor_moe_training import HybridMoRMoEConfig, HybridMoRMoEForCausalLM

model_path = "TorchLLM/HybridMoRMoE"  # or local path

config = HybridMoRMoEConfig.from_pretrained(model_path)
model  = HybridMoRMoEForCausalLM.from_pretrained(model_path, config=config)
tokenizer = AutoTokenizer.from_pretrained(model_path)

model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

messages = [
    {"role": "user", "content": "Explain the difference between MoE and dense transformers."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    out = model.simple_generate(
        inputs["input_ids"],
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Environment setup

pip install torch transformers trl datasets accelerate

HF_TOKEN: If you need to access gated datasets during re-training, export your token:

export HF_TOKEN="your_token_here"

Never hard-code tokens in source files.


Repository Structure

TorchLLM/HybridMoRMoE/
β”œβ”€β”€ config.json                  # Model architecture config
β”œβ”€β”€ generation_config.json       # Default generation settings
β”œβ”€β”€ model.safetensors            # Trained weights (SafeTensors format)
β”œβ”€β”€ tokenizer.json               # Tokenizer vocabulary & rules
β”œβ”€β”€ tokenizer_config.json        # Tokenizer metadata
β”œβ”€β”€ chat_template.jinja          # ChatML chat template
└── hybrid_mor_moe_training.py   # Full training pipeline source

Citation

If you use this model or training code in your research, please cite:

@misc{hybridmormoe2025,
  title  = {HybridMoRMoE: Combining Mixture-of-Recursions and Mixture-of-Experts for Efficient Causal LM},
  author = {Abhishek Gandhi},
  year   = {2026},
  url    = {https://huggingface.co/TorchLLM/HybridMoRMoE}
}

License

Apache 2.0 β€” see LICENSE for details.

Downloads last month
1,979
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TorchLLM/HybridMoRMoE

Finetuned
(790)
this model