Qwen3.5-REAP-20B-A3B
Qwen3.5-REAP-20B-A3B is a Mixture of Experts (MoE) model created by applying Router-weighted Expert Activation Pruning (REAP) to Qwen/Qwen3.5-35B-A3B.
35B → 20B (REAP 50% prune) | ~3B active per token | BF16
| Aspect | Original | Pruned (this model) |
|---|---|---|
| Model | Qwen/Qwen3.5-35B-A3B | Qwen3.5-REAP-20B-A3B |
| Total Parameters | ~35B | ~20B |
| Active Parameters | ~3B | ~3B |
| Experts per Layer | 256 | 128 |
| Experts Routed per Token | 8 | 8 |
| Shared Expert | 1 per layer | 1 per layer (preserved) |
| Hidden Layers | 40 | 40 |
| Layer Types | 30 linear_attention + 10 full_attention | 30 linear_attention + 10 full_attention |
| Hidden Size | 2048 | 2048 |
| MoE Intermediate Size | 512 | 512 |
| Context Length | 262,144 | 262,144 |
| Precision | BF16 | BF16 |
| Disk Size | ~67 GB | ~35 GB |
Key Achievement: 35B → 20B total parameters | ~3B active per token | 50% expert pruning | Shared expert preserved
Important: dtype Must Be bfloat16
This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use torch_dtype=torch.bfloat16 (or --dtype bfloat16 in vLLM). The model will produce garbage output with float16.
How to Use
Requirements
- transformers >= 5.x (from
mainbranch) — theqwen3_5_moemodel type is not in any released version yet - torch >= 2.4 with CUDA support
- ~35 GB disk + ~40 GB VRAM (or use
device_map="auto"for CPU offloading)
Install transformers from main:
pip install git+https://github.com/huggingface/transformers.git@main
With transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "atbender/Qwen3.5-REAP-20B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # Required — float16 causes NaN
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function to compute fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
With vLLM
Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly). Requires --enforce-eager due to hybrid attention.
vllm serve atbender/Qwen3.5-REAP-20B-A3B \
--dtype bfloat16 \
--trust-remote-code \
--enforce-eager \
--max-model-len 4096
| Parameter | Value | Why |
|---|---|---|
--dtype bfloat16 |
Required | GDN linear attention overflows float16 |
--enforce-eager |
Recommended | Avoids torch.compile issues with hybrid linear/full attention |
--trust-remote-code |
Required | Qwen3.5 MoE uses custom model code |
For single GPU with limited VRAM:
vllm serve atbender/Qwen3.5-REAP-20B-A3B \
--dtype bfloat16 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.85 \
--cpu-offload-gb 10 \
--max-model-len 2048
Docker
The pruning was done using this Docker image which has all dependencies pre-installed:
FROM modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3
RUN pip install --no-cache-dir \
git+https://github.com/huggingface/transformers.git@main \
git+https://github.com/huggingface/peft.git@main
Note: Both
transformersandpeftmust be installed frommain— the released versions don't supportqwen3_5_moeand have aHybridCacheimport error respectively.
What Is REAP?
REAP (Router-weighted Expert Activation Pruning) is a pruning method from Cerebras Research that combines router gate statistics with expert activation norms to determine which experts to prune.
Saliency score per expert:
Where:
- $T_j$ = tokens routed to expert $j$ across all calibration data
- $g_{t,j}$ = softmax gate probability for token $t$, expert $j$
- $e_{t,j}$ = expert $j$'s output for token $t$
Unlike frequency-only pruning, REAP weighs each expert's contribution by both how often it's selected and how much it actually activates.
Changes Made:
- 50% of MoE experts pruned globally (256 → 128 per layer, across all 40 layers)
- Fused expert tensors (
gate_up_proj,down_proj) sliced along expert dimension - Router gate weights pruned accordingly (only retained expert rows kept)
- Routing mechanism (
num_experts_per_tok = 8) unchanged - Shared expert and shared expert gate untouched
- Attention layers (both linear and full), embeddings, and all non-MoE components untouched
Architecture Notes & Workarounds
Qwen3.5-35B-A3B uses a hybrid attention architecture distinct from previous Qwen models:
- 30 linear attention layers using GDN (Gated Delta Network) with conv1d kernels
- 10 full attention layers (standard multi-head attention with RoPE) at every 4th layer
- Fused expert format: Experts are stored as
Qwen3_5MoeExpertswith stacked tensors (gate_up_proj: [num_experts, 2*intermediate, hidden],down_proj: [num_experts, hidden, intermediate]), not individualnn.ModuleListentries - Shared expert: Each MoE layer has a shared expert (always active) + shared expert gate
- Custom router:
Qwen3_5MoeTopKRouterreturns(softmax_probs, normalized_topk_scores, topk_indices)— different from standard Qwen3 MoE
Workarounds applied during pruning
Fused expert detection: Standard REAP code assumes experts are
nn.ModuleList. Qwen3.5 usesQwen3_5MoeExpertswith stacked tensors. Detection changed to checkhasattr(module.experts, "gate_up_proj")instead ofisinstance(module.experts, nn.ModuleList).Manual expert computation for saliency: Since experts are fused, saliency collection manually computes each expert's output using
F.linear(tokens, experts.gate_up_proj[j])+F.linear(intermediate, experts.down_proj[j])instead of calling individual expert modules.Pruning via tensor slicing: Instead of removing modules from a list, pruning slices
experts.gate_up_proj.data[keep_indices]andexperts.down_proj.data[keep_indices]along the expert dimension, then updatesexperts.num_experts.Config path: This is
Qwen3_5MoeForCausalLM(text-only), sonum_expertsis atmodel.config.num_experts, notmodel.config.text_config.num_experts(which applies to the multimodalConditionalGenerationvariant).
Calibration Data
256 samples of 2,048 tokens from NeelNanda/pile-10k:
| Property | Value |
|---|---|
| Dataset | NeelNanda/pile-10k |
| Samples | 256 |
| Sequence Length | 2,048 |
| Seed | 42 |
Hardware & Runtime
| Stage | Time | Hardware |
|---|---|---|
| Saliency collection (256 samples) | ~8 min | 1× NVIDIA RTX PRO 6000 Blackwell 98GB |
| Expert pruning | <1 sec | CPU |
| Model save | ~2 min | SSD |
| Total pipeline | ~10 min |
Intended Use
- Research on MoE pruning and compression techniques for Qwen3.5 hybrid attention models
- Base model for further compression (W4A16 quantization → atbender/Qwen3.5-REAP-20B-A3B-W4A16)
- Exploring sparsity-performance trade-offs in dense-expert MoE architectures (256 experts is unusually high)
- Local deployment of a smaller Qwen3.5 MoE variant
Limitations
- No post-pruning fine-tuning — raw prune, quality degradation expected on tail tasks
- Aggressive compression — 50% expert removal is significant; some capabilities will be lost
- Calibration bias — General text (pile-10k) calibration; code-heavy or domain-specific tasks may be disproportionately affected
- Not benchmarked — no formal evals run yet; contributions welcome
- Text-only — This is the CausalLM variant, not the multimodal ConditionalGeneration model
- Requires transformers from main —
qwen3_5_moemodel type not in any released version yet
Citation
@article{lasby2025reap,
title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
author={Lasby, Mike and others},
year={2025},
url={https://github.com/CerebrasResearch/reap}
}
Acknowledgments
- Qwen team — Qwen3.5-35B-A3B base model
- Cerebras Research — REAP pruning method
License
Apache License 2.0 — same as the base Qwen3.5-35B-A3B model.
- Downloads last month
- 59