0xSero's picture
docs: add sponsor acknowledgment
5f82f7e verified
metadata
language:
  - en
license: apache-2.0
tags:
  - safetensors
  - qwen3_next
  - moe
  - pruning
  - reap
  - cerebras
  - expert-pruning
  - qwen3-coder
base_model:
  - Qwen/Qwen3-Coder-Next
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: qwen3-coder-next-64b-REAP
    results:
      - task:
          type: text-generation
        dataset:
          name: ARC-Challenge
          type: arc_challenge
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 64
      - task:
          type: text-generation
        dataset:
          name: BoolQ
          type: boolq
        metrics:
          - name: accuracy
            type: acc
            value: 91
      - task:
          type: text-generation
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 66
      - task:
          type: text-generation
        dataset:
          name: WinoGrande
          type: winogrande
        metrics:
          - name: accuracy
            type: acc
            value: 70
      - task:
          type: text-generation
        dataset:
          name: PIQA
          type: piqa
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 80.5
      - task:
          type: text-generation
        dataset:
          name: CommonsenseQA
          type: commonsense_qa
        metrics:
          - name: accuracy
            type: acc
            value: 88
      - task:
          type: text-generation
        dataset:
          name: TruthfulQA MC2
          type: truthfulqa_mc2
        metrics:
          - name: accuracy
            type: acc
            value: 55.2
      - task:
          type: text-generation
        dataset:
          name: OpenBookQA
          type: openbookqa
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 49
      - task:
          type: text-generation
        dataset:
          name: MathQA
          type: mathqa
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 53.5
      - task:
          type: text-generation
        dataset:
          name: GSM8K
          type: gsm8k
        metrics:
          - name: flexible_extract
            type: exact_match
            value: 28.5

Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

Qwen3-Coder-Next 64B REAP

20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).

Original This Model
Total params ~80B 64.26B
Experts 512 410
Active params/tok ~4.2B ~4.2B
Experts/tok 10 10
Format BF16 BF16
Disk size ~149 GB ~129 GB

REAP removes 20% of MoE experts (102 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~14% reduction in total disk/memory footprint with minimal quality loss.

Method

REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:

  1. Router gate values -- how often and how strongly the router selects each expert
  2. Expert activation norms -- magnitude of each expert's output contribution
  3. Frequency-weighted saliency -- combining routing frequency with activation importance
  4. Router logit renormalization -- maintains output distribution after expert removal
  5. Layerwise application -- independent per-layer pruning decisions for stability

Calibration Dataset

22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:

Category Samples Source
Coding (general) 4,096 theblackcat102/evol-codealpaca-v1
Reasoning (code) ~2,680 open-r1/Mixture-of-Thoughts[code]
Reasoning (math) ~2,778 open-r1/Mixture-of-Thoughts[math]
Reasoning (science) ~2,776 open-r1/Mixture-of-Thoughts[science]
Tool calling 4,096 Salesforce/xlam-function-calling-60k
Agentic coding 4,096 SWE-bench/SWE-smith-trajectories
+ extended domains ~1,478 Scientific, CUDA kernels, browser, advanced math, code correctness

Total tokens observed: ~90.5M across 6,391 packed sequences.

Pruning Configuration

Parameter Value
Compression ratio 0.20 (20% expert removal)
Original experts per layer 512
Remaining experts per layer 410
Pruning method REAP
Distance measure Angular (cosine)
Router weight renormalization Yes
Seed 42
Observation batch size 8
Calibration batches 128 per category

Benchmark Results

10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:

Task Metric Original REAP 0.20 Delta
ARC-Challenge acc_norm 58.5% 64.0% +5.5
BoolQ acc 93.0% 91.0% -2.0
CommonsenseQA acc 89.0% 88.0% -1.0
GSM8K flexible_extract 35.0% 28.5% -6.5
HellaSwag acc_norm 72.0% 66.0% -6.0
MathQA acc_norm 60.5% 53.5% -7.0
OpenBookQA acc_norm 48.5% 49.0% +0.5
PIQA acc_norm 80.0% 80.5% +0.5
TruthfulQA MC2 acc 60.2% 55.2% -5.0
WinoGrande acc 70.0% 70.0% +0.0

Aggregate:

  • Overall average: 66.7% -> 64.6% (-2.1 pts)
  • Reasoning average: 71.4% -> 70.5% (-0.9 pts)
  • Math average: 47.8% -> 41.0% (-6.8 pts)

Architecture

Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:

  • Full attention every 4th layer (12 layers)
  • Linear attention for remaining layers (36 layers)
  • MoE FFN with 410 remaining experts per layer, 10 active per token
  • Shared expert (intermediate size 512) in every layer
  • Context window: 262,144 tokens
  • Vocab size: 151,936

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/qwen3-coder-next-64b-REAP"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

vllm serve 0xSero/qwen3-coder-next-64b-REAP \
    --tensor-parallel-size 4 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

Reproducing

git clone https://github.com/cerebras/reap
cd reap

python -m reap.layerwise_prune \
    --model-name Qwen/Qwen3-Coder-Next \
    --dataset-name combined \
    --compression-ratio 0.20 \
    --prune-method reap \
    --seed 42 \
    --renormalize_router_weights true \
    --batch_size 8 \
    --batches_per_category 128

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Links

Sponsors

Thank you for the kind sponsors, wouldn't be possible without them:

  • Nvidia
  • TNG Technology
  • Lambda
  • Prime Intellect
  • HotAisle