language:
- en
license: apache-2.0
tags:
- safetensors
- qwen3_next
- moe
- pruning
- reap
- cerebras
- expert-pruning
- qwen3-coder
base_model:
- Qwen/Qwen3-Coder-Next
library_name: transformers
pipeline_tag: text-generation
model-index:
- name: qwen3-coder-next-64b-REAP
results:
- task:
type: text-generation
dataset:
name: ARC-Challenge
type: arc_challenge
metrics:
- name: acc_norm
type: acc_norm
value: 64
- task:
type: text-generation
dataset:
name: BoolQ
type: boolq
metrics:
- name: accuracy
type: acc
value: 91
- task:
type: text-generation
dataset:
name: HellaSwag
type: hellaswag
metrics:
- name: acc_norm
type: acc_norm
value: 66
- task:
type: text-generation
dataset:
name: WinoGrande
type: winogrande
metrics:
- name: accuracy
type: acc
value: 70
- task:
type: text-generation
dataset:
name: PIQA
type: piqa
metrics:
- name: acc_norm
type: acc_norm
value: 80.5
- task:
type: text-generation
dataset:
name: CommonsenseQA
type: commonsense_qa
metrics:
- name: accuracy
type: acc
value: 88
- task:
type: text-generation
dataset:
name: TruthfulQA MC2
type: truthfulqa_mc2
metrics:
- name: accuracy
type: acc
value: 55.2
- task:
type: text-generation
dataset:
name: OpenBookQA
type: openbookqa
metrics:
- name: acc_norm
type: acc_norm
value: 49
- task:
type: text-generation
dataset:
name: MathQA
type: mathqa
metrics:
- name: acc_norm
type: acc_norm
value: 53.5
- task:
type: text-generation
dataset:
name: GSM8K
type: gsm8k
metrics:
- name: flexible_extract
type: exact_match
value: 28.5
Support this work: donate.sybilsolutions.ai
REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection
Qwen3-Coder-Next 64B REAP
20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).
| Original | This Model | |
|---|---|---|
| Total params | ~80B | 64.26B |
| Experts | 512 | 410 |
| Active params/tok | ~4.2B | ~4.2B |
| Experts/tok | 10 | 10 |
| Format | BF16 | BF16 |
| Disk size | ~149 GB | ~129 GB |
REAP removes 20% of MoE experts (102 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~14% reduction in total disk/memory footprint with minimal quality loss.
Method
REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:
- Router gate values -- how often and how strongly the router selects each expert
- Expert activation norms -- magnitude of each expert's output contribution
- Frequency-weighted saliency -- combining routing frequency with activation importance
- Router logit renormalization -- maintains output distribution after expert removal
- Layerwise application -- independent per-layer pruning decisions for stability
Calibration Dataset
22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:
| Category | Samples | Source |
|---|---|---|
| Coding (general) | 4,096 | theblackcat102/evol-codealpaca-v1 |
| Reasoning (code) | ~2,680 | open-r1/Mixture-of-Thoughts[code] |
| Reasoning (math) | ~2,778 | open-r1/Mixture-of-Thoughts[math] |
| Reasoning (science) | ~2,776 | open-r1/Mixture-of-Thoughts[science] |
| Tool calling | 4,096 | Salesforce/xlam-function-calling-60k |
| Agentic coding | 4,096 | SWE-bench/SWE-smith-trajectories |
| + extended domains | ~1,478 | Scientific, CUDA kernels, browser, advanced math, code correctness |
Total tokens observed: ~90.5M across 6,391 packed sequences.
Pruning Configuration
| Parameter | Value |
|---|---|
| Compression ratio | 0.20 (20% expert removal) |
| Original experts per layer | 512 |
| Remaining experts per layer | 410 |
| Pruning method | REAP |
| Distance measure | Angular (cosine) |
| Router weight renormalization | Yes |
| Seed | 42 |
| Observation batch size | 8 |
| Calibration batches | 128 per category |
Benchmark Results
10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:
| Task | Metric | Original | REAP 0.20 | Delta |
|---|---|---|---|---|
| ARC-Challenge | acc_norm | 58.5% | 64.0% | +5.5 |
| BoolQ | acc | 93.0% | 91.0% | -2.0 |
| CommonsenseQA | acc | 89.0% | 88.0% | -1.0 |
| GSM8K | flexible_extract | 35.0% | 28.5% | -6.5 |
| HellaSwag | acc_norm | 72.0% | 66.0% | -6.0 |
| MathQA | acc_norm | 60.5% | 53.5% | -7.0 |
| OpenBookQA | acc_norm | 48.5% | 49.0% | +0.5 |
| PIQA | acc_norm | 80.0% | 80.5% | +0.5 |
| TruthfulQA MC2 | acc | 60.2% | 55.2% | -5.0 |
| WinoGrande | acc | 70.0% | 70.0% | +0.0 |
Aggregate:
- Overall average: 66.7% -> 64.6% (-2.1 pts)
- Reasoning average: 71.4% -> 70.5% (-0.9 pts)
- Math average: 47.8% -> 41.0% (-6.8 pts)
Architecture
Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:
- Full attention every 4th layer (12 layers)
- Linear attention for remaining layers (36 layers)
- MoE FFN with 410 remaining experts per layer, 10 active per token
- Shared expert (intermediate size 512) in every layer
- Context window: 262,144 tokens
- Vocab size: 151,936
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "0xSero/qwen3-coder-next-64b-REAP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM
vllm serve 0xSero/qwen3-coder-next-64b-REAP \
--tensor-parallel-size 4 \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--max-model-len 32768
Reproducing
git clone https://github.com/cerebras/reap
cd reap
python -m reap.layerwise_prune \
--model-name Qwen/Qwen3-Coder-Next \
--dataset-name combined \
--compression-ratio 0.20 \
--prune-method reap \
--seed 42 \
--renormalize_router_weights true \
--batch_size 8 \
--batches_per_category 128
Citation
@inproceedings{lasby2025reap,
title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
author={Lasby, Mike and others},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://arxiv.org/abs/2510.13999}
}
Links
- REAP paper: arxiv.org/abs/2510.13999
- REAP code: github.com/cerebras/reap
- Cerebras REAP collection: huggingface.co/collections/cerebras/cerebras-reap
- Base model: Qwen/Qwen3-Coder-Next
- 30% pruned variant: 0xSero/qwen3-coder-next-56b-REAP
Sponsors
Thank you for the kind sponsors, wouldn't be possible without them:
- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle