YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
TRM-textv3.5-MoE
TRM-textv3.5-MoE is a sparse Mixture-of-Experts (MoE) extension of TRM-textv3.5, designed to improve parameter efficiency and domain specialization while preserving the original recursive Transformer architecture.
This model replaces the dense SwiGLU feed-forward network with a Top-K Sparse MoE layer, allowing multiple specialized experts to emerge during continued pretraining and instruction tuning.
Overview
- Base Model:
summerMC/TRM-textv3.5 - Architecture: Recursive Transformer + Sparse MoE
- License: Same as the original TRM-textv3.5
- Framework: Hugging Face Transformers
- Remote Code: Required (
trust_remote_code=True)
Key Features
- Recursive Transformer architecture
- Shared recurrent block reused across passes
- Sparse Top-K Mixture-of-Experts routing
- Auxiliary load-balancing loss
- Router z-loss stabilization
- Hugging Face compatible
- SafeTensors support
- Colab-friendly training scripts
Architecture
Original TRM-textv3.5
Token Embedding
β
Recursive Block Γ recurrence_steps
ββ RMSNorm
ββ Attention + RoPE
ββ SwiGLU MLP
ββ Residual Gates
β
RMSNorm
β
LM Head
TRM-textv3.5-MoE
Token Embedding
β
Recursive Block Γ recurrence_steps
ββ RMSNorm
ββ Attention + RoPE
ββ Sparse MoE
β ββ Router
β ββ Expert 0
β ββ Expert 1
β ββ ...
β ββ Expert 31
ββ Residual Gates
β
RMSNorm
β
LM Head
Configuration
| Parameter | Value |
|---|---|
| Hidden Size | 768 |
| Attention Heads | 12 |
| Head Dimension | 64 |
| Vocabulary Size | 50,259 |
| Recurrence Steps | 4 |
| Max Sequence Length | 512 |
| Experts | 32 |
| Top-K Routing | 2 |
| Router Type | Linear |
| Auxiliary Loss | 0.01 |
| Router Z-Loss | 0.001 |
Parameter Statistics
Approximate parameter counts:
- Total Parameters: ~230.6M
- Active Experts per Token: 2
- Active Parameters per Token: ~70β80M
- Shared Recursive Core: Preserved from TRM-textv3.5
This allows the model to scale capacity significantly without activating all parameters for every token.
Weight Initialization
Dense TRM weights are converted into MoE experts as follows:
Expert 0:
- Exact copy of the original SwiGLU MLP.
Experts 1β31:
- Expert 0 weights plus small Gaussian perturbations.
Router:
- Randomly initialized using Kaiming Uniform initialization.
This initialization preserves the original model behavior while enabling gradual expert specialization during training.
Loading the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "summerMC/TRM-text-MoEV1"
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
)
prompt = "Explain Mixture of Experts."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=100,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Continued Pretraining
Recommended settings:
learning_rate: 5e-5
router_learning_rate: 1e-4
weight_decay: 0.1
micro_batch_size: 1
gradient_accumulation: 32
sequence_length: 512
warmup_steps: 100
optimizer: AdamW
betas: [0.9, 0.95]
gradient_clipping: 1.0
mixed_precision: bf16
Datasets suitable for continued pretraining:
- FineWeb
- FineWeb-Edu
- Japanese web corpora
- Domain-specific datasets
- Synthetic reasoning datasets
Fine-Tuning
The model supports:
- Supervised Fine-Tuning (SFT)
- DPO
- LoRA
- QLoRA
- Knowledge Distillation
- Multi-stage instruction tuning
Expert specialization often emerges during these stages.
Intended Use
TRM-textv3.5-MoE is intended for:
- Research on sparse recursive language models
- Efficient scaling of small LLMs
- Expert specialization experiments
- Instruction tuning research
- Synthetic data generation pipelines
- Educational and experimental applications
Limitations
- Sequence length remains limited to 512 tokens.
- Expert specialization is not guaranteed immediately after conversion.
- Additional pretraining is required to fully utilize MoE capacity.
- Recursive architectures may behave differently from conventional decoder-only Transformers.
Citation
@misc{trm_textv35_moe,
title={TRM-textv3.5-MoE: Sparse Recursive Transformer with Mixture of Experts},
author={summerMC},
year={2026},
howpublished={Hugging Face}
}
Acknowledgements
TRM-textv3.5-MoE builds upon the original TRM-textv3.5 architecture and incorporates ideas inspired by sparse expert systems such as Mixtral, Switch Transformer, and ST-MoE, adapted to a recursive Transformer framework.
TRM-textv3.5-MoE explores whether small recursive language models can achieve substantially higher effective capacity through sparse activation while maintaining efficient inference characteristics.
- Downloads last month
- 269