Instructions to use mtecnic/Qwen3-Coder-Next-REAP-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mtecnic/Qwen3-Coder-Next-REAP-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mtecnic/Qwen3-Coder-Next-REAP-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mtecnic/Qwen3-Coder-Next-REAP-AWQ")
model = AutoModelForCausalLM.from_pretrained("mtecnic/Qwen3-Coder-Next-REAP-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mtecnic/Qwen3-Coder-Next-REAP-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mtecnic/Qwen3-Coder-Next-REAP-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mtecnic/Qwen3-Coder-Next-REAP-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mtecnic/Qwen3-Coder-Next-REAP-AWQ

SGLang

How to use mtecnic/Qwen3-Coder-Next-REAP-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mtecnic/Qwen3-Coder-Next-REAP-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mtecnic/Qwen3-Coder-Next-REAP-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mtecnic/Qwen3-Coder-Next-REAP-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mtecnic/Qwen3-Coder-Next-REAP-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mtecnic/Qwen3-Coder-Next-REAP-AWQ with Docker Model Runner:
```
docker model run hf.co/mtecnic/Qwen3-Coder-Next-REAP-AWQ
```

Qwen3-Coder-Next-REAP-AWQ

Expert-pruned and AWQ-quantized Qwen3-Coder-Next. 20% of MoE experts removed via REAP saliency analysis across diverse calibration data, then quantized to W4A16. The result is a model that runs ~7-12% faster at the token level, uses significantly less VRAM, and frees up memory for larger KV caches and higher concurrency -- at the cost of occasional quality regressions on certain tasks.

Status: Research / Experimental. This model produces usable output across code, math, reasoning, and general tasks, but some outputs may be less polished than the unpruned baseline -- particularly around structured output formatting and multi-step logic chains. It works. It's faster. It's not perfect. See Limitations for specifics.

Why This Exists

Qwen3-Coder-Next is a large Mixture-of-Experts model with 512 experts per layer, but only 10 are active per token. That means ~98% of expert parameters sit idle for any given input. This creates an opportunity: measure which experts matter least across a diverse workload and remove them.

Fewer experts means:

Smaller model (~32 GB vs ~37 GB for unpruned AWQ) -- 5 GB freed for KV cache
Faster inference -- less memory bandwidth pressure, fewer expert weights to page
Higher concurrency -- more VRAM headroom for batched requests

The trade-off is a small quality hit on some tasks, which we document below.

Model Details

Property	Value
Base Model	Qwen/Qwen3-Coder-Next (BF16, 149 GB)
Architecture	`Qwen3NextForCausalLM` (MoE + Gated DeltaNet hybrid attention)
Layers	48 (36 linear attention + 12 full attention)
Original Experts	512 per layer, 10 active per token
After Pruning	410 per layer (20% removed via REAP)
Quantization	AWQ W4A16, symmetric, group_size=128
Format	`compressed-tensors` (compatible with vLLM, transformers)
Context Length	262,144 tokens
Size on Disk	~32 GB (7 shards)

Evaluation Results

Evaluated on a custom benchmark suite covering code generation, reasoning, tool use, math, general knowledge, and writing. Each test is pass/fail with latency metrics.

REAP-AWQ (This Model, 410 experts)

Run	Categories	Pass Rate	Avg tok/s	Notes
Run 1	code, reasoning, scaffold, chat	13/17 (76%)	141.4	logic_puzzle, instruction_following, JSON formatting failed
Run 2	code, math, general, writing	17/17 (100%)	136.0	Clean sweep on expanded test set

Baseline (Unpruned AWQ, 512 experts)

Run	Categories	Pass Rate	Avg tok/s	Notes
Run 1	code, reasoning, scaffold, chat	13/17 (76%)	126.3	Tool calling failed (vLLM config issue)
Run 2	code, reasoning, scaffold, chat	16/17 (94%)	132.0	Stable after vLLM restart

Speed Comparison

Across comparable test categories, the REAP model is consistently faster:

Category	Baseline (tok/s)	REAP (tok/s)	Speedup
Code	126.2	135.1	+7.1%
Reasoning	131.5	140.9	+7.2%
Scaffold / Tool Use	142.3	152.1	+6.9%
Chat	128.7	138.2	+7.4%

These numbers reflect single-request latency. In practice, the VRAM savings compound at higher concurrency -- the freed memory supports larger batch sizes and longer contexts, where we've observed up to ~14% effective throughput gains in multi-request serving scenarios.

How It Was Made

The REAP Pipeline

REAP (Robust Expert Architecture Pruning) scores each expert by combining activation frequency with contribution magnitude:

REAP(expert) = sum(activation_norm * router_weight) / total_tokens

Experts that rarely fire and contribute little when they do get the lowest scores and are pruned first.

Diverse Calibration

A code model uses different experts for different tasks. Pruning based on code-only data risks removing experts critical for reasoning, creative writing, or math. We calibrated across 4 datasets:

Dataset	Domain	Purpose
`evol-codealpaca-v1`	Code	Core competency
`allenai/c4`	Web text	General language
`WritingPrompts_curated`	Creative writing	Long-form generation
`tulu-3-sft-personas-math`	Math	Reasoning chains

256 samples per dataset, merged by summing accumulator metrics before computing derived scores.

Super-Expert Preservation

Some experts have extremely high peak activations -- they fire rarely but are critical when they do (e.g., handling rare syntax patterns or domain-specific tokens). These "super-experts" are protected from pruning regardless of their average REAP score, preventing catastrophic failures on rare inputs.

Quantization

After pruning, the remaining 410 experts are quantized using AWQ:

Scheme: W4A16, symmetric, group_size=128
Calibration: 256 samples from evol-codealpaca-v1
Format: compressed-tensors via llmcompressor

Layers kept at full precision (these are sensitive to quantization):

MoE router gates (mlp.gate, mlp.shared_expert_gate)
Gated DeltaNet internals (linear_attn.conv1d, in_proj_a, in_proj_b)
Output head (lm_head)

Memory-Managed Execution

The full pipeline ran on 4x RTX 3090 (96 GB VRAM) with 128 GB system RAM. The 149 GB BF16 model doesn't fit in GPU memory, so each phase runs as a separate OS process with CPU offload:

Observe (4 runs, one per dataset) -- hooks accumulate statistics to CPU RAM
Merge -- CPU-only, combines observations
Prune -- fresh model load, in-place expert removal
AWQ -- max_memory caps at 20 GiB/GPU with 100 GiB CPU overflow

Process isolation guarantees clean GPU state between phases.

Limitations

This is an experimental research model. Known issues:

Self-correction loops: On some prompts, the model second-guesses itself more than the baseline ("Wait -- let me re-check..."), producing verbose but ultimately correct answers. This appears to be an artifact of the pruning affecting confidence calibration.
Structured output: Occasional JSON formatting errors (e.g., missing closing brackets). The model understands the structure but sometimes truncates. Constrained decoding (e.g., vLLM's guided_json) mitigates this.
Logic puzzles: The model struggles with certain ordering/constraint satisfaction problems that the baseline also finds difficult. Pruning didn't help here.
Instruction following edge cases: Rarely drops minor formatting instructions (e.g., numbered lists vs. unnumbered). Core instruction comprehension is intact.

None of these are showstoppers for most use cases. The model handles code generation, mathematical reasoning, general Q&A, creative writing, and tool calling well.

Usage

Serving with vLLM

vllm serve mtecnic/Qwen3-Coder-Next-REAP-AWQ \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.93 \
    --trust-remote-code \
    --max-model-len 32768 \
    --max-num-seqs 16

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mtecnic/Qwen3-Coder-Next-REAP-AWQ",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("mtecnic/Qwen3-Coder-Next-REAP-AWQ")

messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Pipeline Configuration

# Observation
samples_per_dataset: 256
max_sequence_length: 512
distance_metric: cosine
datasets:
  - evol-codealpaca-v1
  - c4
  - WritingPrompts_curated
  - tulu-3-sft-personas-math

# Pruning
method: reap
compression_ratio: 0.20
preserve_super_experts: true
seed: 42

# Quantization
method: awq
scheme: W4A16
group_size: 128
calibration_samples: 256

Research Context

This model is the result of extensive experimentation with MoE expert pruning. Key learnings:

40% compression was too aggressive -- an earlier attempt removing 205/512 experts per layer caused noticeable quality degradation across all categories.
20% is the sweet spot for this architecture -- quality is largely preserved while delivering meaningful speed and memory improvements.
Diverse calibration is essential -- code-only calibration misidentifies experts that are critical for reasoning and general language tasks.
Super-expert preservation prevents catastrophic edge cases -- without it, rare but important patterns (unusual syntax, domain-specific tokens) break completely.
The Gated DeltaNet layers are fragile -- quantizing the linear attention internals (conv1d, in_proj_a/b) caused significant quality loss. Keeping them at full precision is non-negotiable.

Hardware Requirements

Minimum: 4x 24 GB GPUs (e.g., RTX 3090/4090) with tensor parallelism
Recommended: 2x 48 GB GPUs (e.g., A6000) or 1x 80 GB GPU (e.g., A100/H100)
The ~5 GB VRAM savings vs. unpruned AWQ is most impactful on memory-constrained setups

Acknowledgments

Qwen for Qwen3-Coder-Next
Cerebras for the REAP pruning methodology
MIT HAN Lab for AWQ
Neural Magic / vLLM for llmcompressor and efficient MoE serving

License

This model inherits the license of the base Qwen3-Coder-Next model. See the Qwen license for details.

Citation

@misc{wienandt2026reap_awq,
  title={REAP Expert Pruning of Qwen3-Coder-Next: 20\% Expert Reduction with AWQ Quantization},
  author={Nic Wienandt},
  year={2026},
  url={https://huggingface.co/mtecnic/Qwen3-Coder-Next-REAP-AWQ}
}

Downloads last month: 357

Safetensors

Model size

9B params

Tensor type

I64

I32

BF16

Model tree for mtecnic/Qwen3-Coder-Next-REAP-AWQ

Base model

Qwen/Qwen3-Coder-Next

Quantized

(105)

this model