Qwen3.5-REAP-20B-A3B

Qwen3.5-REAP-20B-A3B is a Mixture of Experts (MoE) model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen/Qwen3.5-35B-A3B.

35B → 20B (REAP 50% prune) | ~3B active per token | BF16

Aspect	Original	Pruned (this model)
Model	Qwen/Qwen3.5-35B-A3B	Qwen3.5-REAP-20B-A3B
Total Parameters	~35B	~20B
Active Parameters	~3B	~3B
Experts per Layer	256	128
Experts Routed per Token	8	8
Shared Expert	1 per layer	1 per layer (preserved)
Hidden Layers	40	40
Layer Types	30 linear_attention + 10 full_attention	30 linear_attention + 10 full_attention
Hidden Size	2048	2048
MoE Intermediate Size	512	512
Context Length	262,144	262,144
Precision	BF16	BF16
Disk Size	~67 GB	~35 GB

Key Achievement: 35B → 20B total parameters | ~3B active per token | 50% expert pruning | Shared expert preserved

Important: dtype Must Be bfloat16

This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use torch_dtype=torch.bfloat16 (or --dtype bfloat16 in vLLM). The model will produce garbage output with float16.

How to Use

Requirements

transformers >= 5.x (from main branch) — the qwen3_5_moe model type is not in any released version yet
torch >= 2.4 with CUDA support
~35 GB disk + ~40 GB VRAM (or use device_map="auto" for CPU offloading)

Install transformers from main:

pip install git+https://github.com/huggingface/transformers.git@main

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "atbender/Qwen3.5-REAP-20B-A3B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Required — float16 causes NaN
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

With vLLM

Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly). Requires --enforce-eager due to hybrid attention.

vllm serve atbender/Qwen3.5-REAP-20B-A3B \
  --dtype bfloat16 \
  --trust-remote-code \
  --enforce-eager \
  --max-model-len 4096

Parameter	Value	Why
`--dtype bfloat16`	Required	GDN linear attention overflows float16
`--enforce-eager`	Recommended	Avoids `torch.compile` issues with hybrid linear/full attention
`--trust-remote-code`	Required	Qwen3.5 MoE uses custom model code

For single GPU with limited VRAM:

vllm serve atbender/Qwen3.5-REAP-20B-A3B \
  --dtype bfloat16 \
  --trust-remote-code \
  --enforce-eager \
  --gpu-memory-utilization 0.85 \
  --cpu-offload-gb 10 \
  --max-model-len 2048

Docker

The pruning was done using this Docker image which has all dependencies pre-installed:

FROM modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3

RUN pip install --no-cache-dir \
    git+https://github.com/huggingface/transformers.git@main \
    git+https://github.com/huggingface/peft.git@main

Note: Both transformers and peft must be installed from main — the released versions don't support qwen3_5_moe and have a HybridCache import error respectively.

What Is REAP?

REAP (Router-weighted Expert Activation Pruning) is a pruning method from Cerebras Research that combines router gate statistics with expert activation norms to determine which experts to prune.

Saliency score per expert:

$S_j = \frac{1}{|T_j|} \sum_{t \in T_j} g_{t,j} \cdot \|e_{t,j}\|_2$

Where:

$T_j$ = tokens routed to expert $j$ across all calibration data
$g_{t,j}$ = softmax gate probability for token $t$, expert $j$
$e_{t,j}$ = expert $j$'s output for token $t$

Unlike frequency-only pruning, REAP weighs each expert's contribution by both how often it's selected and how much it actually activates.

Changes Made:

50% of MoE experts pruned globally (256 → 128 per layer, across all 40 layers)
Fused expert tensors (gate_up_proj, down_proj) sliced along expert dimension
Router gate weights pruned accordingly (only retained expert rows kept)
Routing mechanism (num_experts_per_tok = 8) unchanged
Shared expert and shared expert gate untouched
Attention layers (both linear and full), embeddings, and all non-MoE components untouched

Architecture Notes & Workarounds

Qwen3.5-35B-A3B uses a hybrid attention architecture distinct from previous Qwen models:

30 linear attention layers using GDN (Gated Delta Network) with conv1d kernels
10 full attention layers (standard multi-head attention with RoPE) at every 4th layer
Fused expert format: Experts are stored as Qwen3_5MoeExperts with stacked tensors (gate_up_proj: [num_experts, 2*intermediate, hidden], down_proj: [num_experts, hidden, intermediate]), not individual nn.ModuleList entries
Shared expert: Each MoE layer has a shared expert (always active) + shared expert gate
Custom router: Qwen3_5MoeTopKRouter returns (softmax_probs, normalized_topk_scores, topk_indices) — different from standard Qwen3 MoE

Workarounds applied during pruning

Fused expert detection: Standard REAP code assumes experts are nn.ModuleList. Qwen3.5 uses Qwen3_5MoeExperts with stacked tensors. Detection changed to check hasattr(module.experts, "gate_up_proj") instead of isinstance(module.experts, nn.ModuleList).
Manual expert computation for saliency: Since experts are fused, saliency collection manually computes each expert's output using F.linear(tokens, experts.gate_up_proj[j]) + F.linear(intermediate, experts.down_proj[j]) instead of calling individual expert modules.
Pruning via tensor slicing: Instead of removing modules from a list, pruning slices experts.gate_up_proj.data[keep_indices] and experts.down_proj.data[keep_indices] along the expert dimension, then updates experts.num_experts.
Config path: This is Qwen3_5MoeForCausalLM (text-only), so num_experts is at model.config.num_experts, not model.config.text_config.num_experts (which applies to the multimodal ConditionalGeneration variant).

Calibration Data

256 samples of 2,048 tokens from NeelNanda/pile-10k:

Property	Value
Dataset	`NeelNanda/pile-10k`
Samples	256
Sequence Length	2,048
Seed	42

Hardware & Runtime

Stage	Time	Hardware
Saliency collection (256 samples)	~8 min	1× NVIDIA RTX PRO 6000 Blackwell 98GB
Expert pruning	<1 sec	CPU
Model save	~2 min	SSD
Total pipeline	~10 min

Intended Use

Research on MoE pruning and compression techniques for Qwen3.5 hybrid attention models
Base model for further compression (W4A16 quantization → atbender/Qwen3.5-REAP-20B-A3B-W4A16)
Exploring sparsity-performance trade-offs in dense-expert MoE architectures (256 experts is unusually high)
Local deployment of a smaller Qwen3.5 MoE variant

Limitations

No post-pruning fine-tuning — raw prune, quality degradation expected on tail tasks
Aggressive compression — 50% expert removal is significant; some capabilities will be lost
Calibration bias — General text (pile-10k) calibration; code-heavy or domain-specific tasks may be disproportionately affected
Not benchmarked — no formal evals run yet; contributions welcome
Text-only — This is the CausalLM variant, not the multimodal ConditionalGeneration model
Requires transformers from main — qwen3_5_moe model type not in any released version yet

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}