Qwen3-Coder-Next-REAP-AWQ
Expert-pruned and AWQ-quantized Qwen3-Coder-Next. 20% of MoE experts removed via REAP saliency analysis across diverse calibration data, then quantized to W4A16. The result is a model that runs ~7-12% faster at the token level, uses significantly less VRAM, and frees up memory for larger KV caches and higher concurrency -- at the cost of occasional quality regressions on certain tasks.
Status: Research / Experimental. This model produces usable output across code, math, reasoning, and general tasks, but some outputs may be less polished than the unpruned baseline -- particularly around structured output formatting and multi-step logic chains. It works. It's faster. It's not perfect. See Limitations for specifics.
Why This Exists
Qwen3-Coder-Next is a large Mixture-of-Experts model with 512 experts per layer, but only 10 are active per token. That means ~98% of expert parameters sit idle for any given input. This creates an opportunity: measure which experts matter least across a diverse workload and remove them.
Fewer experts means:
- Smaller model (~32 GB vs ~37 GB for unpruned AWQ) -- 5 GB freed for KV cache
- Faster inference -- less memory bandwidth pressure, fewer expert weights to page
- Higher concurrency -- more VRAM headroom for batched requests
The trade-off is a small quality hit on some tasks, which we document below.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-Coder-Next (BF16, 149 GB) |
| Architecture | Qwen3NextForCausalLM (MoE + Gated DeltaNet hybrid attention) |
| Layers | 48 (36 linear attention + 12 full attention) |
| Original Experts | 512 per layer, 10 active per token |
| After Pruning | 410 per layer (20% removed via REAP) |
| Quantization | AWQ W4A16, symmetric, group_size=128 |
| Format | compressed-tensors (compatible with vLLM, transformers) |
| Context Length | 262,144 tokens |
| Size on Disk | ~32 GB (7 shards) |
Evaluation Results
Evaluated on a custom benchmark suite covering code generation, reasoning, tool use, math, general knowledge, and writing. Each test is pass/fail with latency metrics.
REAP-AWQ (This Model, 410 experts)
| Run | Categories | Pass Rate | Avg tok/s | Notes |
|---|---|---|---|---|
| Run 1 | code, reasoning, scaffold, chat | 13/17 (76%) | 141.4 | logic_puzzle, instruction_following, JSON formatting failed |
| Run 2 | code, math, general, writing | 17/17 (100%) | 136.0 | Clean sweep on expanded test set |
Baseline (Unpruned AWQ, 512 experts)
| Run | Categories | Pass Rate | Avg tok/s | Notes |
|---|---|---|---|---|
| Run 1 | code, reasoning, scaffold, chat | 13/17 (76%) | 126.3 | Tool calling failed (vLLM config issue) |
| Run 2 | code, reasoning, scaffold, chat | 16/17 (94%) | 132.0 | Stable after vLLM restart |
Speed Comparison
Across comparable test categories, the REAP model is consistently faster:
| Category | Baseline (tok/s) | REAP (tok/s) | Speedup |
|---|---|---|---|
| Code | 126.2 | 135.1 | +7.1% |
| Reasoning | 131.5 | 140.9 | +7.2% |
| Scaffold / Tool Use | 142.3 | 152.1 | +6.9% |
| Chat | 128.7 | 138.2 | +7.4% |
These numbers reflect single-request latency. In practice, the VRAM savings compound at higher concurrency -- the freed memory supports larger batch sizes and longer contexts, where we've observed up to ~14% effective throughput gains in multi-request serving scenarios.
How It Was Made
The REAP Pipeline
REAP (Robust Expert Architecture Pruning) scores each expert by combining activation frequency with contribution magnitude:
REAP(expert) = sum(activation_norm * router_weight) / total_tokens
Experts that rarely fire and contribute little when they do get the lowest scores and are pruned first.
Diverse Calibration
A code model uses different experts for different tasks. Pruning based on code-only data risks removing experts critical for reasoning, creative writing, or math. We calibrated across 4 datasets:
| Dataset | Domain | Purpose |
|---|---|---|
evol-codealpaca-v1 |
Code | Core competency |
allenai/c4 |
Web text | General language |
WritingPrompts_curated |
Creative writing | Long-form generation |
tulu-3-sft-personas-math |
Math | Reasoning chains |
256 samples per dataset, merged by summing accumulator metrics before computing derived scores.
Super-Expert Preservation
Some experts have extremely high peak activations -- they fire rarely but are critical when they do (e.g., handling rare syntax patterns or domain-specific tokens). These "super-experts" are protected from pruning regardless of their average REAP score, preventing catastrophic failures on rare inputs.
Quantization
After pruning, the remaining 410 experts are quantized using AWQ:
- Scheme: W4A16, symmetric, group_size=128
- Calibration: 256 samples from
evol-codealpaca-v1 - Format:
compressed-tensorsviallmcompressor
Layers kept at full precision (these are sensitive to quantization):
- MoE router gates (
mlp.gate,mlp.shared_expert_gate) - Gated DeltaNet internals (
linear_attn.conv1d,in_proj_a,in_proj_b) - Output head (
lm_head)
Memory-Managed Execution
The full pipeline ran on 4x RTX 3090 (96 GB VRAM) with 128 GB system RAM. The 149 GB BF16 model doesn't fit in GPU memory, so each phase runs as a separate OS process with CPU offload:
- Observe (4 runs, one per dataset) -- hooks accumulate statistics to CPU RAM
- Merge -- CPU-only, combines observations
- Prune -- fresh model load, in-place expert removal
- AWQ --
max_memorycaps at 20 GiB/GPU with 100 GiB CPU overflow
Process isolation guarantees clean GPU state between phases.
Limitations
This is an experimental research model. Known issues:
- Self-correction loops: On some prompts, the model second-guesses itself more than the baseline ("Wait -- let me re-check..."), producing verbose but ultimately correct answers. This appears to be an artifact of the pruning affecting confidence calibration.
- Structured output: Occasional JSON formatting errors (e.g., missing closing brackets). The model understands the structure but sometimes truncates. Constrained decoding (e.g., vLLM's
guided_json) mitigates this. - Logic puzzles: The model struggles with certain ordering/constraint satisfaction problems that the baseline also finds difficult. Pruning didn't help here.
- Instruction following edge cases: Rarely drops minor formatting instructions (e.g., numbered lists vs. unnumbered). Core instruction comprehension is intact.
None of these are showstoppers for most use cases. The model handles code generation, mathematical reasoning, general Q&A, creative writing, and tool calling well.
Usage
Serving with vLLM
vllm serve mtecnic/Qwen3-Coder-Next-REAP-AWQ \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.93 \
--trust-remote-code \
--max-model-len 32768 \
--max-num-seqs 16
Python (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mtecnic/Qwen3-Coder-Next-REAP-AWQ",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("mtecnic/Qwen3-Coder-Next-REAP-AWQ")
messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Pipeline Configuration
# Observation
samples_per_dataset: 256
max_sequence_length: 512
distance_metric: cosine
datasets:
- evol-codealpaca-v1
- c4
- WritingPrompts_curated
- tulu-3-sft-personas-math
# Pruning
method: reap
compression_ratio: 0.20
preserve_super_experts: true
seed: 42
# Quantization
method: awq
scheme: W4A16
group_size: 128
calibration_samples: 256
Research Context
This model is the result of extensive experimentation with MoE expert pruning. Key learnings:
- 40% compression was too aggressive -- an earlier attempt removing 205/512 experts per layer caused noticeable quality degradation across all categories.
- 20% is the sweet spot for this architecture -- quality is largely preserved while delivering meaningful speed and memory improvements.
- Diverse calibration is essential -- code-only calibration misidentifies experts that are critical for reasoning and general language tasks.
- Super-expert preservation prevents catastrophic edge cases -- without it, rare but important patterns (unusual syntax, domain-specific tokens) break completely.
- The Gated DeltaNet layers are fragile -- quantizing the linear attention internals (
conv1d,in_proj_a/b) caused significant quality loss. Keeping them at full precision is non-negotiable.
Hardware Requirements
- Minimum: 4x 24 GB GPUs (e.g., RTX 3090/4090) with tensor parallelism
- Recommended: 2x 48 GB GPUs (e.g., A6000) or 1x 80 GB GPU (e.g., A100/H100)
- The ~5 GB VRAM savings vs. unpruned AWQ is most impactful on memory-constrained setups
Acknowledgments
- Qwen for Qwen3-Coder-Next
- Cerebras for the REAP pruning methodology
- MIT HAN Lab for AWQ
- Neural Magic / vLLM for
llmcompressorand efficient MoE serving
License
This model inherits the license of the base Qwen3-Coder-Next model. See the Qwen license for details.
Citation
@misc{wienandt2026reap_awq,
title={REAP Expert Pruning of Qwen3-Coder-Next: 20\% Expert Reduction with AWQ Quantization},
author={Nic Wienandt},
year={2026},
url={https://huggingface.co/mtecnic/Qwen3-Coder-Next-REAP-AWQ}
}
- Downloads last month
- 8
Model tree for mtecnic/Qwen3-Coder-Next-REAP-AWQ
Base model
Qwen/Qwen3-Coder-Next