Update README.md

7766680 verified about 1 month ago

4.49 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- mixture-of-experts
	- mixture-of-recursions
	- causal-lm
	- custom-architecture
	- pytorch
	base_model: Qwen/Qwen2.5-0.5B-Instruct
	pipeline_tag: text-generation
	---

	# HybridMoRMoE — Hybrid Mixture-of-Recursions & Mixture-of-Experts

	A custom causal language model combining Mixture-of-Recursions (MoR) with Mixture-of-Experts (MoE) routing, built from scratch in PyTorch and trained via a three-stage pipeline (pre-training → SFT → GRPO).

	---

	## Architecture

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Model type \| `hybrid_mor_moe` \|
	\| Hidden dim (`d_model`) \| 576 \|
	\| Feed-forward dim (`d_ff`) \| 1536 \|
	\| Attention heads \| 8 \|
	\| Base layers \| 6 \|
	\| Shared recursive blocks \| 6 \|
	\| Unique last layers \| 2 \|
	\| Total transformer depth \| 30 \|
	\| Number of experts \| 4 \|
	\| Experts per token \| 1 \|
	\| Max recursions \| 3 \|
	\| Router percentile \| 0.70 \|
	\| Sequence length \| 4096 \|
	\| Vocabulary size \| 151,665 \|
	\| Tokenizer \| Qwen2Tokenizer (Qwen2.5 compatible) \|

	Key design choices:
	- Shared weight blocks are recursively applied based on a learned complexity score
	- A per-token MoE router selects which expert processes each position
	- Auxiliary routing loss (`router_aux_loss_coef = 1e-4`) encourages load balance
	- Chat template follows the ChatML (`<\|im_start\|>` / `<\|im_end\|>`) format

	---

	## Training Pipeline

	The model was trained in three sequential stages on a single NVIDIA P100 (16 GB HBM2):

	\| Stage \| Method \| Notes \|
	\|---\|---\|---\|
	\| 1 \| Pre-training \| Causal LM on open-domain text \|
	\| 2 \| SFT (Supervised Fine-Tuning) \| Instruction following with packing \|
	\| 3 \| GRPO (Group Relative Policy Optimisation) \| Reinforcement learning from preference signal \|

	Training used FP16 precision throughout (P100 has no BF16 support).

	---

	## Usage

	Because this model uses a custom architecture not registered in the Hugging Face Transformers library by default, you must load the modelling code alongside the weights.

	### Quick inference

	```python
	import torch
	from transformers import AutoTokenizer

	# 1. Clone / download this repo
	# 2. Make sure hybrid_mor_moe_training.py is on your Python path
	# (it registers HybridMoRMoEForCausalLM & HybridMoRMoEConfig with AutoModel)

	from hybrid_mor_moe_training import HybridMoRMoEConfig, HybridMoRMoEForCausalLM

	model_path = "TorchLLM/HybridMoRMoE" # or local path

	config = HybridMoRMoEConfig.from_pretrained(model_path)
	model = HybridMoRMoEForCausalLM.from_pretrained(model_path, config=config)
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	model.eval()
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)

	messages = [
	{"role": "user", "content": "Explain the difference between MoE and dense transformers."}
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(device)

	with torch.no_grad():
	out = model.simple_generate(
	inputs["input_ids"],
	max_new_tokens=256,
	temperature=0.7,
	top_p=0.9,
	eos_token_id=tokenizer.eos_token_id,
	)

	print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### Environment setup

	```bash
	pip install torch transformers trl datasets accelerate
	```

	> HF_TOKEN: If you need to access gated datasets during re-training, export your token:
	> ```bash
	> export HF_TOKEN="your_token_here"
	> ```
	> Never hard-code tokens in source files.

	---

	## Repository Structure

	```
	TorchLLM/HybridMoRMoE/
	├── config.json # Model architecture config
	├── generation_config.json # Default generation settings
	├── model.safetensors # Trained weights (SafeTensors format)
	├── tokenizer.json # Tokenizer vocabulary & rules
	├── tokenizer_config.json # Tokenizer metadata
	├── chat_template.jinja # ChatML chat template
	└── hybrid_mor_moe_training.py # Full training pipeline source
	```

	---

	## Citation

	If you use this model or training code in your research, please cite:

	```bibtex
	@misc{hybridmormoe2025,
	title = {HybridMoRMoE: Combining Mixture-of-Recursions and Mixture-of-Experts for Efficient Causal LM},
	author = {Abhishek Gandhi},
	year = {2026},
	url = {https://huggingface.co/TorchLLM/HybridMoRMoE}
	}
	```

	---

	## License

	Apache 2.0 — see [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.