Qwen3-Coder-Next 56B REAP
30% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).
| Original | This Model | |
|---|---|---|
| Total params | ~80B | 56.56B |
| Experts | 512 | 359 |
| Active params/tok | ~4.2B | ~4.2B |
| Experts/tok | 10 | 10 |
| Format | BF16 | BF16 |
| Disk size | ~149 GB | ~113 GB |
REAP removes 30% of MoE experts (153 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~24% reduction in total disk/memory footprint at the cost of moderate quality degradation, primarily in math tasks.
Method
REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:
- Router gate values -- how often and how strongly the router selects each expert
- Expert activation norms -- magnitude of each expert's output contribution
- Frequency-weighted saliency -- combining routing frequency with activation importance
- Router logit renormalization -- maintains output distribution after expert removal
- Layerwise application -- independent per-layer pruning decisions for stability
Calibration Dataset
22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:
| Category | Samples | Source |
|---|---|---|
| Coding (general) | 4,096 | theblackcat102/evol-codealpaca-v1 |
| Reasoning (code) | ~2,680 | open-r1/Mixture-of-Thoughts[code] |
| Reasoning (math) | ~2,778 | open-r1/Mixture-of-Thoughts[math] |
| Reasoning (science) | ~2,776 | open-r1/Mixture-of-Thoughts[science] |
| Tool calling | 4,096 | Salesforce/xlam-function-calling-60k |
| Agentic coding | 4,096 | SWE-bench/SWE-smith-trajectories |
| + extended domains | ~1,478 | Scientific, CUDA kernels, browser, advanced math, code correctness |
Total tokens observed: ~90.5M across 6,391 packed sequences.
Pruning Configuration
| Parameter | Value |
|---|---|
| Compression ratio | 0.30 (30% expert removal) |
| Original experts per layer | 512 |
| Remaining experts per layer | 359 |
| Pruning method | REAP |
| Distance measure | Angular (cosine) |
| Router weight renormalization | Yes |
| Seed | 42 |
| Observation batch size | 8 |
| Calibration batches | 128 per category |
Benchmark Results
10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:
| Task | Metric | Original | REAP 0.30 | Delta |
|---|---|---|---|---|
| ARC-Challenge | acc_norm | 58.5% | 61.0% | +2.5 |
| BoolQ | acc | 93.0% | 90.0% | -3.0 |
| CommonsenseQA | acc | 89.0% | 85.5% | -3.5 |
| GSM8K | flexible_extract | 35.0% | 17.5% | -17.5 |
| HellaSwag | acc_norm | 72.0% | 63.5% | -8.5 |
| MathQA | acc_norm | 60.5% | 51.5% | -9.0 |
| OpenBookQA | acc_norm | 48.5% | 49.5% | +1.0 |
| PIQA | acc_norm | 80.0% | 79.0% | -1.0 |
| TruthfulQA MC2 | acc | 60.2% | 55.5% | -4.7 |
| WinoGrande | acc | 70.0% | 66.0% | -4.0 |
Aggregate:
- Overall average: 66.7% -> 61.9% (-4.8 pts)
- Reasoning average: 71.4% -> 68.8% (-2.6 pts)
- Math average: 47.8% -> 34.5% (-13.3 pts)
Note: GSM8K strict-match reports 0% for all variants due to an output formatting issue; flexible-extract scores are shown instead.
Architecture
Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:
- Full attention every 4th layer (12 layers)
- Linear attention for remaining layers (36 layers)
- MoE FFN with 359 remaining experts per layer, 10 active per token
- Shared expert (intermediate size 512) in every layer
- Context window: 262,144 tokens
- Vocab size: 151,936
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "0xSero/qwen3-coder-next-56b-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM
vllm serve 0xSero/qwen3-coder-next-56b-REAP \
--tensor-parallel-size 4 \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--max-model-len 32768
Reproducing
git clone https://github.com/cerebras/reap
cd reap
python -m reap.layerwise_prune \
--model-name Qwen/Qwen3-Coder-Next \
--dataset-name combined \
--compression-ratio 0.30 \
--prune-method reap \
--seed 42 \
--renormalize_router_weights true \
--batch_size 8 \
--batches_per_category 128
Citation
@inproceedings{lasby2025reap,
title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
author={Lasby, Mike and others},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://arxiv.org/abs/2510.13999}
}
Links
- REAP paper: arxiv.org/abs/2510.13999
- REAP code: github.com/cerebras/reap
- Cerebras REAP collection: huggingface.co/collections/cerebras/cerebras-reap
- Base model: Qwen/Qwen3-Coder-Next
- 20% pruned variant: 0xSero/qwen3-coder-next-64b-REAP
- Downloads last month
- 287
Model tree for 0xSero/qwen3-coder-next-56b-REAP
Paper for 0xSero/qwen3-coder-next-56b-REAP
Evaluation results
- acc_norm on ARC-Challengeself-reported61.000
- accuracy on BoolQself-reported90.000
- acc_norm on HellaSwagself-reported63.500
- accuracy on WinoGrandeself-reported66.000
- acc_norm on PIQAself-reported79.000
- accuracy on CommonsenseQAself-reported85.500
- accuracy on TruthfulQA MC2self-reported55.500
- acc_norm on OpenBookQAself-reported49.500
- acc_norm on MathQAself-reported51.500
- flexible_extract on GSM8Kself-reported17.500