Qwen3-Coder-Next-REAP-AWQ

Expert-pruned and AWQ-quantized Qwen3-Coder-Next. 20% of MoE experts removed via REAP saliency analysis across diverse calibration data, then quantized to W4A16. The result is a model that runs ~7-12% faster at the token level, uses significantly less VRAM, and frees up memory for larger KV caches and higher concurrency -- at the cost of occasional quality regressions on certain tasks.

Status: Research / Experimental. This model produces usable output across code, math, reasoning, and general tasks, but some outputs may be less polished than the unpruned baseline -- particularly around structured output formatting and multi-step logic chains. It works. It's faster. It's not perfect. See Limitations for specifics.

Why This Exists

Qwen3-Coder-Next is a large Mixture-of-Experts model with 512 experts per layer, but only 10 are active per token. That means ~98% of expert parameters sit idle for any given input. This creates an opportunity: measure which experts matter least across a diverse workload and remove them.

Fewer experts means:

  • Smaller model (~32 GB vs ~37 GB for unpruned AWQ) -- 5 GB freed for KV cache
  • Faster inference -- less memory bandwidth pressure, fewer expert weights to page
  • Higher concurrency -- more VRAM headroom for batched requests

The trade-off is a small quality hit on some tasks, which we document below.

Model Details

Property Value
Base Model Qwen/Qwen3-Coder-Next (BF16, 149 GB)
Architecture Qwen3NextForCausalLM (MoE + Gated DeltaNet hybrid attention)
Layers 48 (36 linear attention + 12 full attention)
Original Experts 512 per layer, 10 active per token
After Pruning 410 per layer (20% removed via REAP)
Quantization AWQ W4A16, symmetric, group_size=128
Format compressed-tensors (compatible with vLLM, transformers)
Context Length 262,144 tokens
Size on Disk ~32 GB (7 shards)

Evaluation Results

Evaluated on a custom benchmark suite covering code generation, reasoning, tool use, math, general knowledge, and writing. Each test is pass/fail with latency metrics.

REAP-AWQ (This Model, 410 experts)

Run Categories Pass Rate Avg tok/s Notes
Run 1 code, reasoning, scaffold, chat 13/17 (76%) 141.4 logic_puzzle, instruction_following, JSON formatting failed
Run 2 code, math, general, writing 17/17 (100%) 136.0 Clean sweep on expanded test set

Baseline (Unpruned AWQ, 512 experts)

Run Categories Pass Rate Avg tok/s Notes
Run 1 code, reasoning, scaffold, chat 13/17 (76%) 126.3 Tool calling failed (vLLM config issue)
Run 2 code, reasoning, scaffold, chat 16/17 (94%) 132.0 Stable after vLLM restart

Speed Comparison

Across comparable test categories, the REAP model is consistently faster:

Category Baseline (tok/s) REAP (tok/s) Speedup
Code 126.2 135.1 +7.1%
Reasoning 131.5 140.9 +7.2%
Scaffold / Tool Use 142.3 152.1 +6.9%
Chat 128.7 138.2 +7.4%

These numbers reflect single-request latency. In practice, the VRAM savings compound at higher concurrency -- the freed memory supports larger batch sizes and longer contexts, where we've observed up to ~14% effective throughput gains in multi-request serving scenarios.

How It Was Made

The REAP Pipeline

REAP (Robust Expert Architecture Pruning) scores each expert by combining activation frequency with contribution magnitude:

REAP(expert) = sum(activation_norm * router_weight) / total_tokens

Experts that rarely fire and contribute little when they do get the lowest scores and are pruned first.

Diverse Calibration

A code model uses different experts for different tasks. Pruning based on code-only data risks removing experts critical for reasoning, creative writing, or math. We calibrated across 4 datasets:

Dataset Domain Purpose
evol-codealpaca-v1 Code Core competency
allenai/c4 Web text General language
WritingPrompts_curated Creative writing Long-form generation
tulu-3-sft-personas-math Math Reasoning chains

256 samples per dataset, merged by summing accumulator metrics before computing derived scores.

Super-Expert Preservation

Some experts have extremely high peak activations -- they fire rarely but are critical when they do (e.g., handling rare syntax patterns or domain-specific tokens). These "super-experts" are protected from pruning regardless of their average REAP score, preventing catastrophic failures on rare inputs.

Quantization

After pruning, the remaining 410 experts are quantized using AWQ:

  • Scheme: W4A16, symmetric, group_size=128
  • Calibration: 256 samples from evol-codealpaca-v1
  • Format: compressed-tensors via llmcompressor

Layers kept at full precision (these are sensitive to quantization):

  • MoE router gates (mlp.gate, mlp.shared_expert_gate)
  • Gated DeltaNet internals (linear_attn.conv1d, in_proj_a, in_proj_b)
  • Output head (lm_head)

Memory-Managed Execution

The full pipeline ran on 4x RTX 3090 (96 GB VRAM) with 128 GB system RAM. The 149 GB BF16 model doesn't fit in GPU memory, so each phase runs as a separate OS process with CPU offload:

  1. Observe (4 runs, one per dataset) -- hooks accumulate statistics to CPU RAM
  2. Merge -- CPU-only, combines observations
  3. Prune -- fresh model load, in-place expert removal
  4. AWQ -- max_memory caps at 20 GiB/GPU with 100 GiB CPU overflow

Process isolation guarantees clean GPU state between phases.

Limitations

This is an experimental research model. Known issues:

  • Self-correction loops: On some prompts, the model second-guesses itself more than the baseline ("Wait -- let me re-check..."), producing verbose but ultimately correct answers. This appears to be an artifact of the pruning affecting confidence calibration.
  • Structured output: Occasional JSON formatting errors (e.g., missing closing brackets). The model understands the structure but sometimes truncates. Constrained decoding (e.g., vLLM's guided_json) mitigates this.
  • Logic puzzles: The model struggles with certain ordering/constraint satisfaction problems that the baseline also finds difficult. Pruning didn't help here.
  • Instruction following edge cases: Rarely drops minor formatting instructions (e.g., numbered lists vs. unnumbered). Core instruction comprehension is intact.

None of these are showstoppers for most use cases. The model handles code generation, mathematical reasoning, general Q&A, creative writing, and tool calling well.

Usage

Serving with vLLM

vllm serve mtecnic/Qwen3-Coder-Next-REAP-AWQ \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.93 \
    --trust-remote-code \
    --max-model-len 32768 \
    --max-num-seqs 16

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mtecnic/Qwen3-Coder-Next-REAP-AWQ",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("mtecnic/Qwen3-Coder-Next-REAP-AWQ")

messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Pipeline Configuration

# Observation
samples_per_dataset: 256
max_sequence_length: 512
distance_metric: cosine
datasets:
  - evol-codealpaca-v1
  - c4
  - WritingPrompts_curated
  - tulu-3-sft-personas-math

# Pruning
method: reap
compression_ratio: 0.20
preserve_super_experts: true
seed: 42

# Quantization
method: awq
scheme: W4A16
group_size: 128
calibration_samples: 256

Research Context

This model is the result of extensive experimentation with MoE expert pruning. Key learnings:

  • 40% compression was too aggressive -- an earlier attempt removing 205/512 experts per layer caused noticeable quality degradation across all categories.
  • 20% is the sweet spot for this architecture -- quality is largely preserved while delivering meaningful speed and memory improvements.
  • Diverse calibration is essential -- code-only calibration misidentifies experts that are critical for reasoning and general language tasks.
  • Super-expert preservation prevents catastrophic edge cases -- without it, rare but important patterns (unusual syntax, domain-specific tokens) break completely.
  • The Gated DeltaNet layers are fragile -- quantizing the linear attention internals (conv1d, in_proj_a/b) caused significant quality loss. Keeping them at full precision is non-negotiable.

Hardware Requirements

  • Minimum: 4x 24 GB GPUs (e.g., RTX 3090/4090) with tensor parallelism
  • Recommended: 2x 48 GB GPUs (e.g., A6000) or 1x 80 GB GPU (e.g., A100/H100)
  • The ~5 GB VRAM savings vs. unpruned AWQ is most impactful on memory-constrained setups

Acknowledgments

License

This model inherits the license of the base Qwen3-Coder-Next model. See the Qwen license for details.

Citation

@misc{wienandt2026reap_awq,
  title={REAP Expert Pruning of Qwen3-Coder-Next: 20\% Expert Reduction with AWQ Quantization},
  author={Nic Wienandt},
  year={2026},
  url={https://huggingface.co/mtecnic/Qwen3-Coder-Next-REAP-AWQ}
}
Downloads last month
8
Safetensors
Model size
9B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mtecnic/Qwen3-Coder-Next-REAP-AWQ

Quantized
(93)
this model