docs: add sponsor acknowledgment

5f82f7e verified 11 days ago

8.6 kB

language:
  - en
license: apache-2.0
tags:
  - safetensors
  - qwen3_next
  - moe
  - pruning
  - reap
  - cerebras
  - expert-pruning
  - qwen3-coder
base_model:
  - Qwen/Qwen3-Coder-Next
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: qwen3-coder-next-64b-REAP
    results:
      - task:
          type: text-generation
        dataset:
          name: ARC-Challenge
          type: arc_challenge
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 64
      - task:
          type: text-generation
        dataset:
          name: BoolQ
          type: boolq
        metrics:
          - name: accuracy
            type: acc
            value: 91
      - task:
          type: text-generation
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 66
      - task:
          type: text-generation
        dataset:
          name: WinoGrande
          type: winogrande
        metrics:
          - name: accuracy
            type: acc
            value: 70
      - task:
          type: text-generation
        dataset:
          name: PIQA
          type: piqa
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 80.5
      - task:
          type: text-generation
        dataset:
          name: CommonsenseQA
          type: commonsense_qa
        metrics:
          - name: accuracy
            type: acc
            value: 88
      - task:
          type: text-generation
        dataset:
          name: TruthfulQA MC2
          type: truthfulqa_mc2
        metrics:
          - name: accuracy
            type: acc
            value: 55.2
      - task:
          type: text-generation
        dataset:
          name: OpenBookQA
          type: openbookqa
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 49
      - task:
          type: text-generation
        dataset:
          name: MathQA
          type: mathqa
        metrics:
          - name: acc_norm
            type: acc_norm
            value: 53.5
      - task:
          type: text-generation
        dataset:
          name: GSM8K
          type: gsm8k
        metrics:
          - name: flexible_extract
            type: exact_match
            value: 28.5

Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

Qwen3-Coder-Next 64B REAP

20% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).

	Original	This Model
Total params	~80B	64.26B
Experts	512	410
Active params/tok	~4.2B	~4.2B
Experts/tok	10	10
Format	BF16	BF16
Disk size	~149 GB	~129 GB

REAP removes 20% of MoE experts (102 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~14% reduction in total disk/memory footprint with minimal quality loss.

Method

REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:

Router gate values -- how often and how strongly the router selects each expert
Expert activation norms -- magnitude of each expert's output contribution
Frequency-weighted saliency -- combining routing frequency with activation importance
Router logit renormalization -- maintains output distribution after expert removal
Layerwise application -- independent per-layer pruning decisions for stability

Calibration Dataset

22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:

Category	Samples	Source
Coding (general)	4,096	`theblackcat102/evol-codealpaca-v1`
Reasoning (code)	~2,680	`open-r1/Mixture-of-Thoughts[code]`
Reasoning (math)	~2,778	`open-r1/Mixture-of-Thoughts[math]`
Reasoning (science)	~2,776	`open-r1/Mixture-of-Thoughts[science]`
Tool calling	4,096	`Salesforce/xlam-function-calling-60k`
Agentic coding	4,096	`SWE-bench/SWE-smith-trajectories`
+ extended domains	~1,478	Scientific, CUDA kernels, browser, advanced math, code correctness

Total tokens observed: ~90.5M across 6,391 packed sequences.

Pruning Configuration

Parameter	Value
Compression ratio	0.20 (20% expert removal)
Original experts per layer	512
Remaining experts per layer	410
Pruning method	REAP
Distance measure	Angular (cosine)
Router weight renormalization	Yes
Seed	42
Observation batch size	8
Calibration batches	128 per category

Benchmark Results

10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:

Task	Metric	Original	REAP 0.20	Delta
ARC-Challenge	acc_norm	58.5%	64.0%	+5.5
BoolQ	acc	93.0%	91.0%	-2.0
CommonsenseQA	acc	89.0%	88.0%	-1.0
GSM8K	flexible_extract	35.0%	28.5%	-6.5
HellaSwag	acc_norm	72.0%	66.0%	-6.0
MathQA	acc_norm	60.5%	53.5%	-7.0
OpenBookQA	acc_norm	48.5%	49.0%	+0.5
PIQA	acc_norm	80.0%	80.5%	+0.5
TruthfulQA MC2	acc	60.2%	55.2%	-5.0
WinoGrande	acc	70.0%	70.0%	+0.0

Aggregate:

Overall average: 66.7% -> 64.6% (-2.1 pts)
Reasoning average: 71.4% -> 70.5% (-0.9 pts)
Math average: 47.8% -> 41.0% (-6.8 pts)

Architecture

Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:

Full attention every 4th layer (12 layers)
Linear attention for remaining layers (36 layers)
MoE FFN with 410 remaining experts per layer, 10 active per token
Shared expert (intermediate size 512) in every layer
Context window: 262,144 tokens
Vocab size: 151,936

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/qwen3-coder-next-64b-REAP"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

vllm serve 0xSero/qwen3-coder-next-64b-REAP \
    --tensor-parallel-size 4 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

Reproducing

git clone https://github.com/cerebras/reap
cd reap

python -m reap.layerwise_prune \
    --model-name Qwen/Qwen3-Coder-Next \
    --dataset-name combined \
    --compression-ratio 0.20 \
    --prune-method reap \
    --seed 42 \
    --renormalize_router_weights true \
    --batch_size 8 \
    --batches_per_category 128

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

0xSero
/

qwen3-coder-next-64b-REAP