Qwen3-Coder-Next 56B REAP

30% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).

Original This Model
Total params ~80B 56.56B
Experts 512 359
Active params/tok ~4.2B ~4.2B
Experts/tok 10 10
Format BF16 BF16
Disk size ~149 GB ~113 GB

REAP removes 30% of MoE experts (153 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~24% reduction in total disk/memory footprint at the cost of moderate quality degradation, primarily in math tasks.

Method

REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:

  1. Router gate values -- how often and how strongly the router selects each expert
  2. Expert activation norms -- magnitude of each expert's output contribution
  3. Frequency-weighted saliency -- combining routing frequency with activation importance
  4. Router logit renormalization -- maintains output distribution after expert removal
  5. Layerwise application -- independent per-layer pruning decisions for stability

Calibration Dataset

22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:

Category Samples Source
Coding (general) 4,096 theblackcat102/evol-codealpaca-v1
Reasoning (code) ~2,680 open-r1/Mixture-of-Thoughts[code]
Reasoning (math) ~2,778 open-r1/Mixture-of-Thoughts[math]
Reasoning (science) ~2,776 open-r1/Mixture-of-Thoughts[science]
Tool calling 4,096 Salesforce/xlam-function-calling-60k
Agentic coding 4,096 SWE-bench/SWE-smith-trajectories
+ extended domains ~1,478 Scientific, CUDA kernels, browser, advanced math, code correctness

Total tokens observed: ~90.5M across 6,391 packed sequences.

Pruning Configuration

Parameter Value
Compression ratio 0.30 (30% expert removal)
Original experts per layer 512
Remaining experts per layer 359
Pruning method REAP
Distance measure Angular (cosine)
Router weight renormalization Yes
Seed 42
Observation batch size 8
Calibration batches 128 per category

Benchmark Results

10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:

Task Metric Original REAP 0.30 Delta
ARC-Challenge acc_norm 58.5% 61.0% +2.5
BoolQ acc 93.0% 90.0% -3.0
CommonsenseQA acc 89.0% 85.5% -3.5
GSM8K flexible_extract 35.0% 17.5% -17.5
HellaSwag acc_norm 72.0% 63.5% -8.5
MathQA acc_norm 60.5% 51.5% -9.0
OpenBookQA acc_norm 48.5% 49.5% +1.0
PIQA acc_norm 80.0% 79.0% -1.0
TruthfulQA MC2 acc 60.2% 55.5% -4.7
WinoGrande acc 70.0% 66.0% -4.0

Aggregate:

  • Overall average: 66.7% -> 61.9% (-4.8 pts)
  • Reasoning average: 71.4% -> 68.8% (-2.6 pts)
  • Math average: 47.8% -> 34.5% (-13.3 pts)

Note: GSM8K strict-match reports 0% for all variants due to an output formatting issue; flexible-extract scores are shown instead.

Architecture

Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:

  • Full attention every 4th layer (12 layers)
  • Linear attention for remaining layers (36 layers)
  • MoE FFN with 359 remaining experts per layer, 10 active per token
  • Shared expert (intermediate size 512) in every layer
  • Context window: 262,144 tokens
  • Vocab size: 151,936

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/qwen3-coder-next-56b-REAP"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

vllm serve 0xSero/qwen3-coder-next-56b-REAP \
    --tensor-parallel-size 4 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

Reproducing

git clone https://github.com/cerebras/reap
cd reap

python -m reap.layerwise_prune \
    --model-name Qwen/Qwen3-Coder-Next \
    --dataset-name combined \
    --compression-ratio 0.30 \
    --prune-method reap \
    --seed 42 \
    --renormalize_router_weights true \
    --batch_size 8 \
    --batches_per_category 128

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Links

Downloads last month
287
Safetensors
Model size
57B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/qwen3-coder-next-56b-REAP

Finetuned
(26)
this model
Quantizations
1 model

Paper for 0xSero/qwen3-coder-next-56b-REAP

Evaluation results