Qwen3.5-REAP-20B-A3B

Qwen3.5-REAP-20B-A3B is a Mixture of Experts (MoE) model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen/Qwen3.5-35B-A3B.

35B → 20B (REAP 50% prune) | ~3B active per token | BF16

Aspect Original Pruned (this model)
Model Qwen/Qwen3.5-35B-A3B Qwen3.5-REAP-20B-A3B
Total Parameters ~35B ~20B
Active Parameters ~3B ~3B
Experts per Layer 256 128
Experts Routed per Token 8 8
Shared Expert 1 per layer 1 per layer (preserved)
Hidden Layers 40 40
Layer Types 30 linear_attention + 10 full_attention 30 linear_attention + 10 full_attention
Hidden Size 2048 2048
MoE Intermediate Size 512 512
Context Length 262,144 262,144
Precision BF16 BF16
Disk Size ~67 GB ~35 GB

Key Achievement: 35B → 20B total parameters | ~3B active per token | 50% expert pruning | Shared expert preserved


Important: dtype Must Be bfloat16

This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use torch_dtype=torch.bfloat16 (or --dtype bfloat16 in vLLM). The model will produce garbage output with float16.


How to Use

Requirements

  • transformers >= 5.x (from main branch) — the qwen3_5_moe model type is not in any released version yet
  • torch >= 2.4 with CUDA support
  • ~35 GB disk + ~40 GB VRAM (or use device_map="auto" for CPU offloading)

Install transformers from main:

pip install git+https://github.com/huggingface/transformers.git@main

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "atbender/Qwen3.5-REAP-20B-A3B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Required — float16 causes NaN
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

With vLLM

Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly). Requires --enforce-eager due to hybrid attention.

vllm serve atbender/Qwen3.5-REAP-20B-A3B \
  --dtype bfloat16 \
  --trust-remote-code \
  --enforce-eager \
  --max-model-len 4096
Parameter Value Why
--dtype bfloat16 Required GDN linear attention overflows float16
--enforce-eager Recommended Avoids torch.compile issues with hybrid linear/full attention
--trust-remote-code Required Qwen3.5 MoE uses custom model code

For single GPU with limited VRAM:

vllm serve atbender/Qwen3.5-REAP-20B-A3B \
  --dtype bfloat16 \
  --trust-remote-code \
  --enforce-eager \
  --gpu-memory-utilization 0.85 \
  --cpu-offload-gb 10 \
  --max-model-len 2048

Docker

The pruning was done using this Docker image which has all dependencies pre-installed:

FROM modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3

RUN pip install --no-cache-dir \
    git+https://github.com/huggingface/transformers.git@main \
    git+https://github.com/huggingface/peft.git@main

Note: Both transformers and peft must be installed from main — the released versions don't support qwen3_5_moe and have a HybridCache import error respectively.


What Is REAP?

REAP (Router-weighted Expert Activation Pruning) is a pruning method from Cerebras Research that combines router gate statistics with expert activation norms to determine which experts to prune.

Saliency score per expert:

Sj=1TjtTjgt,jet,j2S_j = \frac{1}{|T_j|} \sum_{t \in T_j} g_{t,j} \cdot \|e_{t,j}\|_2

Where:

  • $T_j$ = tokens routed to expert $j$ across all calibration data
  • $g_{t,j}$ = softmax gate probability for token $t$, expert $j$
  • $e_{t,j}$ = expert $j$'s output for token $t$

Unlike frequency-only pruning, REAP weighs each expert's contribution by both how often it's selected and how much it actually activates.

Changes Made:

  • 50% of MoE experts pruned globally (256 → 128 per layer, across all 40 layers)
  • Fused expert tensors (gate_up_proj, down_proj) sliced along expert dimension
  • Router gate weights pruned accordingly (only retained expert rows kept)
  • Routing mechanism (num_experts_per_tok = 8) unchanged
  • Shared expert and shared expert gate untouched
  • Attention layers (both linear and full), embeddings, and all non-MoE components untouched

Architecture Notes & Workarounds

Qwen3.5-35B-A3B uses a hybrid attention architecture distinct from previous Qwen models:

  • 30 linear attention layers using GDN (Gated Delta Network) with conv1d kernels
  • 10 full attention layers (standard multi-head attention with RoPE) at every 4th layer
  • Fused expert format: Experts are stored as Qwen3_5MoeExperts with stacked tensors (gate_up_proj: [num_experts, 2*intermediate, hidden], down_proj: [num_experts, hidden, intermediate]), not individual nn.ModuleList entries
  • Shared expert: Each MoE layer has a shared expert (always active) + shared expert gate
  • Custom router: Qwen3_5MoeTopKRouter returns (softmax_probs, normalized_topk_scores, topk_indices) — different from standard Qwen3 MoE

Workarounds applied during pruning

  1. Fused expert detection: Standard REAP code assumes experts are nn.ModuleList. Qwen3.5 uses Qwen3_5MoeExperts with stacked tensors. Detection changed to check hasattr(module.experts, "gate_up_proj") instead of isinstance(module.experts, nn.ModuleList).

  2. Manual expert computation for saliency: Since experts are fused, saliency collection manually computes each expert's output using F.linear(tokens, experts.gate_up_proj[j]) + F.linear(intermediate, experts.down_proj[j]) instead of calling individual expert modules.

  3. Pruning via tensor slicing: Instead of removing modules from a list, pruning slices experts.gate_up_proj.data[keep_indices] and experts.down_proj.data[keep_indices] along the expert dimension, then updates experts.num_experts.

  4. Config path: This is Qwen3_5MoeForCausalLM (text-only), so num_experts is at model.config.num_experts, not model.config.text_config.num_experts (which applies to the multimodal ConditionalGeneration variant).


Calibration Data

256 samples of 2,048 tokens from NeelNanda/pile-10k:

Property Value
Dataset NeelNanda/pile-10k
Samples 256
Sequence Length 2,048
Seed 42

Hardware & Runtime

Stage Time Hardware
Saliency collection (256 samples) ~8 min 1× NVIDIA RTX PRO 6000 Blackwell 98GB
Expert pruning <1 sec CPU
Model save ~2 min SSD
Total pipeline ~10 min

Intended Use

  • Research on MoE pruning and compression techniques for Qwen3.5 hybrid attention models
  • Base model for further compression (W4A16 quantization → atbender/Qwen3.5-REAP-20B-A3B-W4A16)
  • Exploring sparsity-performance trade-offs in dense-expert MoE architectures (256 experts is unusually high)
  • Local deployment of a smaller Qwen3.5 MoE variant

Limitations

  • No post-pruning fine-tuning — raw prune, quality degradation expected on tail tasks
  • Aggressive compression — 50% expert removal is significant; some capabilities will be lost
  • Calibration bias — General text (pile-10k) calibration; code-heavy or domain-specific tasks may be disproportionately affected
  • Not benchmarked — no formal evals run yet; contributions welcome
  • Text-only — This is the CausalLM variant, not the multimodal ConditionalGeneration model
  • Requires transformers from mainqwen3_5_moe model type not in any released version yet

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

Acknowledgments

  • Qwen team — Qwen3.5-35B-A3B base model
  • Cerebras Research — REAP pruning method

License

Apache License 2.0 — same as the base Qwen3.5-35B-A3B model.

Downloads last month
59
Safetensors
Model size
19B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for atbender/Qwen3.5-REAP-20B-A3B

Finetuned
(84)
this model
Quantizations
1 model