Nemotron-Cascade-2-30B-A3B with Block Repeat (Layer 5)

Applying the RYS (Repeat Your Steps) layer duplication method to NVIDIA's hybrid Mamba2+MoE+GQA architecture. Achieves +6.7pp improvement on BBH-style reasoning benchmarks with no training and no weight changes.

Key Finding: Hybrid Architectures Allow Multiple Repeat Strategies

Unlike pure MoE models (GPT-OSS-20B) where only Attention repeat works, Nemotron-Cascade-2's hybrid architecture (Mamba-2 + MoE + GQA Attention) shows that all three mixer types benefit from repetition at specific positions:

Configuration Score Delta Layer Type
attn-L5 11/15 +6.7pp GQA Attention (1st attention layer)
attn-L19,L26 11/15 +6.7pp GQA Attention pair
moe-L43,L45 11/15 +6.7pp MoE (late layers)
mamba-L44,L46 11/15 +6.7pp Mamba-2 (late layers)
baseline 10/15 0.0pp -
attn-L12 10/15 0.0pp GQA Attention
attn-L33 6/15 -26.7pp GQA Attention (middle)
attn-ALL 3/15 -46.7pp All 6 GQA layers

Architecture

Nemotron-Cascade-2 has 52 layers with 3 types of mixer blocks:

Layer  0: Mamba2      Layer 26: Attention*
Layer  1: MoE         Layer 27: MoE
Layer  2: Mamba2      Layer 28: Mamba2
Layer  3: MoE         ...
Layer  4: Mamba2      Layer 33: Attention
Layer  5: Attention*  <-- THIS LAYER IS REPEATED
Layer  6: MoE         Layer 34: MoE
Layer  7: Mamba2      ...
...                   Layer 42: Attention
Layer 12: Attention   Layer 43: MoE
Layer 13: MoE         Layer 44: Mamba2
...                   ...
Layer 19: Attention   Layer 50: Mamba2
Layer 20: MoE         Layer 51: MoE
  • Mamba-2: 23 layers (SSM for efficient sequence modeling)
  • MoE: 23 layers (128 routed experts + 1 shared, top-6 routing)
  • GQA Attention: 6 layers at positions 5, 12, 19, 26, 33, 42

Cross-Model Comparison

Model Architecture Best Config Improvement
GPT-OSS-20B Uniform (Attn+MoE) Attn L19-20 only +13.3pp
Nemotron-Cascade-2 Hybrid (Mamba+MoE+GQA) Any mixer at right position +6.7pp

Key insight: In uniform architectures, only the dense (Attention) component benefits from repetition. In hybrid architectures, all component types can benefit because each mixer type serves a distinct computational role.

Usage

With the loader script (recommended)

from load_model import load_repeat_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_repeat_model()
sampler = make_sampler(temp=0.0)
response = generate(model, tokenizer, prompt="Your question here", max_tokens=512, sampler=sampler)
print(response)

Manual patching

import mlx.nn as nn
from mlx_lm import load

class BlockRepeatWrapper(nn.Module):
    def __init__(self, block):
        super().__init__()
        self.block = block
        for attr in ['block_type']:
            if hasattr(block, attr):
                setattr(self, attr, getattr(block, attr))

    def __call__(self, x, *args, **kwargs):
        h = self.block(x, *args, **kwargs)
        if isinstance(h, tuple): h = h[0]
        h2 = self.block(h, *args, **kwargs)
        if isinstance(h2, tuple): h2 = h2[0]
        return h2

model, tokenizer = load("mlx-community/Nemotron-Cascade-2-30B-A3B-4bit")
model.backbone.layers[5] = BlockRepeatWrapper(model.backbone.layers[5])

Alternative configurations (all +6.7pp)

# MoE repeat (late layers)
for idx in [43, 45]:
    model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])

# Mamba-2 repeat (late layers)
for idx in [44, 46]:
    model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])

# Attention pair repeat
for idx in [19, 26]:
    model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])

Model Details

  • Base model: nvidia/Nemotron-Cascade-2-30B-A3B (via mlx-community 4bit)
  • Total params: ~30B (3B active per token)
  • Modification: Layer 5 (first GQA Attention) repeated once
  • Quantization: 4-bit
  • License: NVIDIA Open Model License

Method

Based on llm-circuit-finder (RYS method). Extended to hybrid Mamba+MoE+Attention architectures, discovering that all mixer types can benefit from repetition in hybrid models.

Citation

@misc{nemotron-cascade-2-repeat,
  title={Layer Repeat for Hybrid MoE Models: RYS Method on Nemotron-Cascade-2},
  author={shi3z},
  year={2026},
  note={Based on RYS method. Extends attention-repeat findings from GPT-OSS-20B to hybrid architectures.}
}
Downloads last month
164
Safetensors
Model size
32B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for shi3z/nemotron-cascade-2-attn-repeat-L5

Quantized
(31)
this model