Qwen3-Coder-Next: RCO pruned variants

Expert-pruned checkpoints of Qwen/Qwen3-Coder-Next produced by Riemannian Constrained Optimization (RCO).

Paper: Model Compression with Exact Budget Constraints via Riemannian Manifolds (Helcig & Alistarh, 2026)
Code: github.com/IST-DASLab/RCO

Eight variants, one per (sparsity × calibration × allocation):

Sparsity	Calibration	Allocation	Folder
25%	coding	uniform	`coding-25-uniform/`
25%	coding	nonuniform	`coding-25-nonuniform/`
50%	coding	uniform	`coding-50-uniform/`
50%	coding	nonuniform	`coding-50-nonuniform/`
25%	general	uniform	`general-25-uniform/`
25%	general	nonuniform	`general-25-nonuniform/`
50%	general	uniform	`general-50-uniform/`
50%	general	nonuniform	`general-50-nonuniform/`

What is expert pruning?

Qwen3-Coder-Next is a Mixture-of-Experts (MoE) model with 512 routed experts per layer (top-10 active per token). Most experts are rarely used. Expert pruning permanently removes low-impact experts, shrinking the checkpoint and reducing memory at inference time.

Uniform vs nonuniform allocation

Each variant prunes a fixed fraction (25% or 50%) of the total experts across all 48 layers. The key design choice is how to distribute that budget across layers:

Uniform: every layer keeps the same number of experts (e.g., 384 per layer at 25%, 256 at 50%). This is simple and compatible with stock inference frameworks: config.num_experts is a single integer, so vLLM, HuggingFace, and SGLang load the checkpoint without any code changes. However, forcing the same budget on every layer is suboptimal because some layers are more sensitive to pruning than others.
Nonuniform: the optimizer distributes the pruning budget across layers based on calibration loss. Critical layers keep more experts; redundant layers are pruned more aggressively. At the same total sparsity, this recovers more of the base model's quality (e.g., 97% HumanEval recovery vs 55% at 50% sparsity). The trade-off: each layer has a different expert count, which stock frameworks don't support out of the box. Nonuniform variants include a bundled vllm_pruned_patch.py that monkey-patches vLLM to handle per-layer expert counts (setup in the section below, and a one-page reference in each variant's LOAD_VLLM.md).

The gap grows with sparsity. At 25%, uniform is about 8 points behind nonuniform on HumanEval. At 50%, the gap is 42 points (0.409 vs 0.720). For general benchmarks, uniform and nonuniform perform comparably at both sparsity levels (within 1 point on MC-8).

Coding vs general calibration

RCO optimizes which experts to prune by minimizing KL divergence on a calibration dataset. The choice of calibration data determines what the pruned model preserves:

Coding (evol-codealpaca): preserves code generation ability (HumanEval, MBPP, LiveCodeBench) at the cost of general knowledge (MC-8).
General (FineWeb-Edu): preserves general reasoning and knowledge benchmarks (ARC, HellaSwag, MMLU, etc.) but loses coding ability almost entirely.

This is not a limitation of the method; it reflects how specialized the base model's experts are. Pick the calibration that matches your deployment use case.

Which variant should I pick?

Use case	Recommended variant
Coding, easy deployment	`coding-25-uniform` (92% HE, stock vLLM)
Coding, best quality	`coding-25-nonuniform` (100% HE, needs patch)
Coding, max compression	`coding-50-nonuniform` (97% HE, needs patch)
General, easy deployment	`general-25-uniform` (99% MC-8, stock vLLM)
General, best quality	`general-25-nonuniform` (100% MC-8, needs patch)
General, max compression	`general-50-uniform` (92% MC-8, stock vLLM)

How to run

Uniform variants (stock vLLM / HuggingFace)

Uniform variants have a single config.num_experts value. They load with zero code changes:

from vllm import LLM, SamplingParams

llm = LLM(model="./coding-25-uniform", tensor_parallel_size=4,
          dtype="bfloat16", trust_remote_code=True)
out = llm.generate(["def fib(n):"], SamplingParams(max_tokens=256))
print(out[0].outputs[0].text)

Or with HuggingFace transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "./coding-25-uniform", torch_dtype="bfloat16",
    device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("./coding-25-uniform")

Nonuniform variants (needs monkey-patch)

Nonuniform variants have different expert counts per layer. Stock vLLM builds every layer with config.num_experts (a single integer), which causes a shape mismatch on load. The repo provides three files to handle this. Two live inside each nonuniform variant folder:

vllm_pruned_patch.py: overrides Qwen3NextSparseMoeBlock.__init__ to read per-layer counts from config.per_layer_num_experts, and Qwen3NextForCausalLM.get_expert_mapping to use the max kept count
sitecustomize.py: auto-applies the patch in every Python process, including vLLM worker subprocesses spawned via multiprocessing.spawn

And one at the repo root:

run_vllm_nonuniform.py: wrapper script that sets up PYTHONPATH and applies the patch

Note: enforce_eager=True is required when loading nonuniform variants. CUDA-graph capture currently does not support the heterogeneous expert layout.

Option 1: Use the bundled script

# From the repo root
python run_vllm_nonuniform.py --model ./coding-25-nonuniform --tp 4

# Custom prompt
python run_vllm_nonuniform.py --model ./coding-50-nonuniform --tp 4 \
    --prompt "Write a Python function to merge two sorted lists."

Option 2: Set PYTHONPATH manually

The key requirement is that the variant folder (which contains sitecustomize.py) is on PYTHONPATH so that vLLM worker subprocesses pick up the patch automatically:

export PYTHONPATH=/path/to/coding-25-nonuniform:${PYTHONPATH:-}

python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='/path/to/coding-25-nonuniform', tensor_parallel_size=4,
          dtype='bfloat16', trust_remote_code=True, enforce_eager=True)
out = llm.generate(['def fib(n):'], SamplingParams(max_tokens=256))
print(out[0].outputs[0].text)
"

Option 3: From Python (library use)

import sys, os
# Add variant folder to path BEFORE importing vllm
sys.path.insert(0, "/path/to/coding-25-nonuniform")
os.environ["PYTHONPATH"] = "/path/to/coding-25-nonuniform:" + os.environ.get("PYTHONPATH", "")

import vllm_pruned_patch
vllm_pruned_patch.apply()

from vllm import LLM, SamplingParams
llm = LLM(model="/path/to/coding-25-nonuniform",
          tensor_parallel_size=4, dtype="bfloat16",
          trust_remote_code=True, enforce_eager=True)

Why PYTHONPATH? vLLM uses multiprocessing.spawn for worker processes (required with CUDA). Spawned workers re-import all modules from scratch, so a monkey-patch applied only in the parent process is lost. Python's sitecustomize.py mechanism runs automatically in every interpreter that has the relevant directory on sys.path. Putting the variant folder on PYTHONPATH is the simplest way to ensure all workers get the patch.

Note on tensor parallelism: TP works fine with nonuniform variants (TP shards hidden dimensions inside each expert, not across experts). Expert parallelism (EP) does NOT work with heterogeneous counts; keep --enable-eplb off (the default).

Evaluation results

All evaluations run with vLLM (bf16, greedy decoding). Coding benchmarks: HumanEval (pass@1), MBPP (pass@1). General benchmarks: ARC-Challenge, ARC-Easy, BoolQ, HellaSwag, MMLU, OpenBookQA, RTE, WinoGrande (accuracy / acc_norm). MC-8 is the unweighted average of the eight general benchmarks. Recovery is relative to the full (unpruned) model.

Coding benchmarks

Variant	Size	HumanEval	rec.	MBPP	rec.
Full model	159 GB	0.744	n/a	0.764	n/a
coding-25-uniform	121 GB	0.683	92%	0.688	90%
coding-25-nonuniform	121 GB	0.744	100%	0.678	89%
coding-50-uniform	82 GB	0.409	55%	0.534	70%
coding-50-nonuniform	82 GB	0.720	97%	0.690	90%
general-25-uniform	121 GB	0.043	6%	0.046	6%
general-25-nonuniform	121 GB	0.061	8%	0.058	8%
general-50-uniform	82 GB	0.000	0%	0.018	2%
general-50-nonuniform	82 GB	0.012	2%	0.010	1%

General benchmarks (MC-8)

Variant	MC-8 avg	rec.	ARC-C	ARC-E	BoolQ	HSwag	MMLU	OBQA	RTE	WinoG
Full model	0.714	n/a	0.606	0.821	0.885	0.775	0.767	0.430	0.765	0.666
coding-25-uniform	0.656	92%	0.501	0.722	0.864	0.690	0.710	0.380	0.729	0.655
coding-25-nonuniform	0.638	89%	0.462	0.662	0.851	0.665	0.680	0.362	0.776	0.642
coding-50-uniform	0.577	81%	0.403	0.641	0.789	0.578	0.564	0.350	0.671	0.616
coding-50-nonuniform	0.546	76%	0.356	0.555	0.776	0.548	0.543	0.340	0.646	0.603
general-25-uniform	0.707	99%	0.600	0.807	0.876	0.785	0.704	0.452	0.751	0.677
general-25-nonuniform	0.714	100%	0.618	0.822	0.882	0.776	0.712	0.442	0.762	0.699
general-50-uniform	0.654	92%	0.541	0.771	0.839	0.709	0.610	0.428	0.675	0.658
general-50-nonuniform	0.644	90%	0.526	0.762	0.842	0.708	0.595	0.414	0.675	0.635

Key takeaways

Calibration domain determines the trade-off. Coding-calibrated variants preserve code generation (up to 100% HumanEval recovery) but lose general knowledge. General-calibrated variants preserve MC-8 (up to 100% recovery) but lose coding ability entirely.
Nonuniform allocation matters most at high sparsity. At 50% sparsity, nonuniform recovers 97% of HumanEval vs 55% for uniform, a 42-point gap. At 25%, the gap is smaller (100% vs 92%).
25% sparsity is nearly lossless for the target domain. Both coding-25-nonuniform (100% HE) and general-25-nonuniform (100% MC-8) match the full model within noise.
Uniform variants load in stock vLLM/HF with no patches. Nonuniform variants require the bundled vllm_pruned_patch.py (see "How to run" above).

Citation

@misc{helcig2026rco,
  title={Model Compression with Exact Budget Constraints via Riemannian Manifolds},
  author={Helcig, Michael and Alistarh, Dan},
  year={2026},
  eprint={2605.00649},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2605.00649},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISTA-DASLab/Qwen3-Coder-Next-RCO-pruned

Base model

Qwen/Qwen3-Coder-Next

Finetuned

(33)

this model

Collection including ISTA-DASLab/Qwen3-Coder-Next-RCO-pruned

RCO

Collection

1 item • Updated about 8 hours ago

Paper for ISTA-DASLab/Qwen3-Coder-Next-RCO-pruned

Model Compression with Exact Budget Constraints via Riemannian Manifolds

Paper • 2605.00649 • Published 1 day ago