YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

TRM-textv3.5-MoE

TRM-textv3.5-MoE is a sparse Mixture-of-Experts (MoE) extension of TRM-textv3.5, designed to improve parameter efficiency and domain specialization while preserving the original recursive Transformer architecture.

This model replaces the dense SwiGLU feed-forward network with a Top-K Sparse MoE layer, allowing multiple specialized experts to emerge during continued pretraining and instruction tuning.

Overview

  • Base Model: summerMC/TRM-textv3.5
  • Architecture: Recursive Transformer + Sparse MoE
  • License: Same as the original TRM-textv3.5
  • Framework: Hugging Face Transformers
  • Remote Code: Required (trust_remote_code=True)

Key Features

  • Recursive Transformer architecture
  • Shared recurrent block reused across passes
  • Sparse Top-K Mixture-of-Experts routing
  • Auxiliary load-balancing loss
  • Router z-loss stabilization
  • Hugging Face compatible
  • SafeTensors support
  • Colab-friendly training scripts

Architecture

Original TRM-textv3.5

Token Embedding
    ↓
Recursive Block Γ— recurrence_steps
    β”œβ”€ RMSNorm
    β”œβ”€ Attention + RoPE
    β”œβ”€ SwiGLU MLP
    └─ Residual Gates
    ↓
RMSNorm
    ↓
LM Head

TRM-textv3.5-MoE

Token Embedding
    ↓
Recursive Block Γ— recurrence_steps
    β”œβ”€ RMSNorm
    β”œβ”€ Attention + RoPE
    β”œβ”€ Sparse MoE
    β”‚    β”œβ”€ Router
    β”‚    β”œβ”€ Expert 0
    β”‚    β”œβ”€ Expert 1
    β”‚    β”œβ”€ ...
    β”‚    └─ Expert 31
    └─ Residual Gates
    ↓
RMSNorm
    ↓
LM Head

Configuration

Parameter Value
Hidden Size 768
Attention Heads 12
Head Dimension 64
Vocabulary Size 50,259
Recurrence Steps 4
Max Sequence Length 512
Experts 32
Top-K Routing 2
Router Type Linear
Auxiliary Loss 0.01
Router Z-Loss 0.001

Parameter Statistics

Approximate parameter counts:

  • Total Parameters: ~230.6M
  • Active Experts per Token: 2
  • Active Parameters per Token: ~70–80M
  • Shared Recursive Core: Preserved from TRM-textv3.5

This allows the model to scale capacity significantly without activating all parameters for every token.


Weight Initialization

Dense TRM weights are converted into MoE experts as follows:

  • Expert 0:

    • Exact copy of the original SwiGLU MLP.
  • Experts 1–31:

    • Expert 0 weights plus small Gaussian perturbations.
  • Router:

    • Randomly initialized using Kaiming Uniform initialization.

This initialization preserves the original model behavior while enabling gradual expert specialization during training.


Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "summerMC/TRM-text-MoEV1"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
)

prompt = "Explain Mixture of Experts."

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Continued Pretraining

Recommended settings:

learning_rate: 5e-5
router_learning_rate: 1e-4
weight_decay: 0.1
micro_batch_size: 1
gradient_accumulation: 32
sequence_length: 512
warmup_steps: 100
optimizer: AdamW
betas: [0.9, 0.95]
gradient_clipping: 1.0
mixed_precision: bf16

Datasets suitable for continued pretraining:

  • FineWeb
  • FineWeb-Edu
  • Japanese web corpora
  • Domain-specific datasets
  • Synthetic reasoning datasets

Fine-Tuning

The model supports:

  • Supervised Fine-Tuning (SFT)
  • DPO
  • LoRA
  • QLoRA
  • Knowledge Distillation
  • Multi-stage instruction tuning

Expert specialization often emerges during these stages.


Intended Use

TRM-textv3.5-MoE is intended for:

  • Research on sparse recursive language models
  • Efficient scaling of small LLMs
  • Expert specialization experiments
  • Instruction tuning research
  • Synthetic data generation pipelines
  • Educational and experimental applications

Limitations

  • Sequence length remains limited to 512 tokens.
  • Expert specialization is not guaranteed immediately after conversion.
  • Additional pretraining is required to fully utilize MoE capacity.
  • Recursive architectures may behave differently from conventional decoder-only Transformers.

Citation

@misc{trm_textv35_moe,
  title={TRM-textv3.5-MoE: Sparse Recursive Transformer with Mixture of Experts},
  author={summerMC},
  year={2026},
  howpublished={Hugging Face}
}

Acknowledgements

TRM-textv3.5-MoE builds upon the original TRM-textv3.5 architecture and incorporates ideas inspired by sparse expert systems such as Mixtral, Switch Transformer, and ST-MoE, adapted to a recursive Transformer framework.

TRM-textv3.5-MoE explores whether small recursive language models can achieve substantially higher effective capacity through sparse activation while maintaining efficient inference characteristics.

Downloads last month
269
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support