Nemotron-Cascade-2-30B-A3B with Block Repeat (Layer 5)

Applying the RYS (Repeat Your Steps) layer duplication method to NVIDIA's hybrid Mamba2+MoE+GQA architecture. Achieves +6.7pp improvement on BBH-style reasoning benchmarks with no training and no weight changes.

Key Finding: Hybrid Architectures Allow Multiple Repeat Strategies

Unlike pure MoE models (GPT-OSS-20B) where only Attention repeat works, Nemotron-Cascade-2's hybrid architecture (Mamba-2 + MoE + GQA Attention) shows that all three mixer types benefit from repetition at specific positions:

Configuration	Score	Delta	Layer Type
attn-L5	11/15	+6.7pp	GQA Attention (1st attention layer)
attn-L19,L26	11/15	+6.7pp	GQA Attention pair
moe-L43,L45	11/15	+6.7pp	MoE (late layers)
mamba-L44,L46	11/15	+6.7pp	Mamba-2 (late layers)
baseline	10/15	0.0pp	-
attn-L12	10/15	0.0pp	GQA Attention
attn-L33	6/15	-26.7pp	GQA Attention (middle)
attn-ALL	3/15	-46.7pp	All 6 GQA layers

Architecture

Nemotron-Cascade-2 has 52 layers with 3 types of mixer blocks:

Layer  0: Mamba2      Layer 26: Attention*
Layer  1: MoE         Layer 27: MoE
Layer  2: Mamba2      Layer 28: Mamba2
Layer  3: MoE         ...
Layer  4: Mamba2      Layer 33: Attention
Layer  5: Attention*  <-- THIS LAYER IS REPEATED
Layer  6: MoE         Layer 34: MoE
Layer  7: Mamba2      ...
...                   Layer 42: Attention
Layer 12: Attention   Layer 43: MoE
Layer 13: MoE         Layer 44: Mamba2
...                   ...
Layer 19: Attention   Layer 50: Mamba2
Layer 20: MoE         Layer 51: MoE

Mamba-2: 23 layers (SSM for efficient sequence modeling)
MoE: 23 layers (128 routed experts + 1 shared, top-6 routing)
GQA Attention: 6 layers at positions 5, 12, 19, 26, 33, 42

Cross-Model Comparison

Model	Architecture	Best Config	Improvement
GPT-OSS-20B	Uniform (Attn+MoE)	Attn L19-20 only	+13.3pp
Nemotron-Cascade-2	Hybrid (Mamba+MoE+GQA)	Any mixer at right position	+6.7pp

Key insight: In uniform architectures, only the dense (Attention) component benefits from repetition. In hybrid architectures, all component types can benefit because each mixer type serves a distinct computational role.

Usage

With the loader script (recommended)

from load_model import load_repeat_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_repeat_model()
sampler = make_sampler(temp=0.0)
response = generate(model, tokenizer, prompt="Your question here", max_tokens=512, sampler=sampler)
print(response)

Manual patching

import mlx.nn as nn
from mlx_lm import load

class BlockRepeatWrapper(nn.Module):
    def __init__(self, block):
        super().__init__()
        self.block = block
        for attr in ['block_type']:
            if hasattr(block, attr):
                setattr(self, attr, getattr(block, attr))

    def __call__(self, x, *args, **kwargs):
        h = self.block(x, *args, **kwargs)
        if isinstance(h, tuple): h = h[0]
        h2 = self.block(h, *args, **kwargs)
        if isinstance(h2, tuple): h2 = h2[0]
        return h2

model, tokenizer = load("mlx-community/Nemotron-Cascade-2-30B-A3B-4bit")
model.backbone.layers[5] = BlockRepeatWrapper(model.backbone.layers[5])

Alternative configurations (all +6.7pp)

# MoE repeat (late layers)
for idx in [43, 45]:
    model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])

# Mamba-2 repeat (late layers)
for idx in [44, 46]:
    model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])

# Attention pair repeat
for idx in [19, 26]:
    model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])

Model Details

Base model: nvidia/Nemotron-Cascade-2-30B-A3B (via mlx-community 4bit)
Total params: ~30B (3B active per token)
Modification: Layer 5 (first GQA Attention) repeated once
Quantization: 4-bit
License: NVIDIA Open Model License

Method

Based on llm-circuit-finder (RYS method). Extended to hybrid Mamba+MoE+Attention architectures, discovering that all mixer types can benefit from repetition in hybrid models.

Citation

@misc{nemotron-cascade-2-repeat,
  title={Layer Repeat for Hybrid MoE Models: RYS Method on Nemotron-Cascade-2},
  author={shi3z},
  year={2026},
  note={Based on RYS method. Extends attention-repeat findings from GPT-OSS-20B to hybrid architectures.}
}

Downloads last month: 164

Safetensors

Model size

32B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shi3z/nemotron-cascade-2-attn-repeat-L5

Base model

nvidia/Nemotron-Cascade-2-30B-A3B

Quantized

(31)

this model