Instructions to use 0xSero/qwen3-coder-next-56b-REAP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/qwen3-coder-next-56b-REAP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/qwen3-coder-next-56b-REAP")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/qwen3-coder-next-56b-REAP")
model = AutoModelForCausalLM.from_pretrained("0xSero/qwen3-coder-next-56b-REAP")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 0xSero/qwen3-coder-next-56b-REAP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/qwen3-coder-next-56b-REAP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/qwen3-coder-next-56b-REAP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/qwen3-coder-next-56b-REAP

SGLang

How to use 0xSero/qwen3-coder-next-56b-REAP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/qwen3-coder-next-56b-REAP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/qwen3-coder-next-56b-REAP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/qwen3-coder-next-56b-REAP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/qwen3-coder-next-56b-REAP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/qwen3-coder-next-56b-REAP with Docker Model Runner:
```
docker model run hf.co/0xSero/qwen3-coder-next-56b-REAP
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

Qwen3-Coder-Next 56B REAP

30% expert-pruned version of Qwen/Qwen3-Coder-Next using Cerebras REAP (Router-weighted Expert Activation Pruning).

	Original	This Model
Total params	~80B	56.56B
Experts	512	359
Active params/tok	~4.2B	~4.2B
Experts/tok	10	10
Format	BF16	BF16
Disk size	~149 GB	~113 GB

REAP removes 30% of MoE experts (153 of 512) while preserving the model's routing behavior and output quality. The active parameter count per token is unchanged since the router still selects 10 experts per token from the remaining pool. This yields a ~24% reduction in total disk/memory footprint at the cost of moderate quality degradation, primarily in math tasks.

Method

REAP (ICLR 2026) prunes Mixture-of-Experts models by scoring expert importance using:

Router gate values -- how often and how strongly the router selects each expert
Expert activation norms -- magnitude of each expert's output contribution
Frequency-weighted saliency -- combining routing frequency with activation importance
Router logit renormalization -- maintains output distribution after expert removal
Layerwise application -- independent per-layer pruning decisions for stability

Calibration Dataset

22,000 samples (no-refusal subset: 21,000), packed to 16,384 token sequences:

Category	Samples	Source
Coding (general)	4,096	`theblackcat102/evol-codealpaca-v1`
Reasoning (code)	~2,680	`open-r1/Mixture-of-Thoughts[code]`
Reasoning (math)	~2,778	`open-r1/Mixture-of-Thoughts[math]`
Reasoning (science)	~2,776	`open-r1/Mixture-of-Thoughts[science]`
Tool calling	4,096	`Salesforce/xlam-function-calling-60k`
Agentic coding	4,096	`SWE-bench/SWE-smith-trajectories`
+ extended domains	~1,478	Scientific, CUDA kernels, browser, advanced math, code correctness

Total tokens observed: ~90.5M across 6,391 packed sequences.

Pruning Configuration

Parameter	Value
Compression ratio	0.30 (30% expert removal)
Original experts per layer	512
Remaining experts per layer	359
Pruning method	REAP
Distance measure	Angular (cosine)
Router weight renormalization	Yes
Seed	42
Observation batch size	8
Calibration batches	128 per category

Benchmark Results

10-task lm-eval suite, 200 samples per task, tensor_parallel_size=4, vLLM eager mode:

Task	Metric	Original	REAP 0.30	Delta
ARC-Challenge	acc_norm	58.5%	61.0%	+2.5
BoolQ	acc	93.0%	90.0%	-3.0
CommonsenseQA	acc	89.0%	85.5%	-3.5
GSM8K	flexible_extract	35.0%	17.5%	-17.5
HellaSwag	acc_norm	72.0%	63.5%	-8.5
MathQA	acc_norm	60.5%	51.5%	-9.0
OpenBookQA	acc_norm	48.5%	49.5%	+1.0
PIQA	acc_norm	80.0%	79.0%	-1.0
TruthfulQA MC2	acc	60.2%	55.5%	-4.7
WinoGrande	acc	70.0%	66.0%	-4.0

Aggregate:

Overall average: 66.7% -> 61.9% (-4.8 pts)
Reasoning average: 71.4% -> 68.8% (-2.6 pts)
Math average: 47.8% -> 34.5% (-13.3 pts)

Note: GSM8K strict-match reports 0% for all variants due to an output formatting issue; flexible-extract scores are shown instead.

Architecture

Qwen3-Coder-Next uses a hybrid linear/full attention architecture with 48 layers:

Full attention every 4th layer (12 layers)
Linear attention for remaining layers (36 layers)
MoE FFN with 359 remaining experts per layer, 10 active per token
Shared expert (intermediate size 512) in every layer
Context window: 262,144 tokens
Vocab size: 151,936

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "0xSero/qwen3-coder-next-56b-REAP"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a quicksort in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM

vllm serve 0xSero/qwen3-coder-next-56b-REAP \
    --tensor-parallel-size 4 \
    --enforce-eager \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768

Reproducing

git clone https://github.com/cerebras/reap
cd reap

python -m reap.layerwise_prune \
    --model-name Qwen/Qwen3-Coder-Next \
    --dataset-name combined \
    --compression-ratio 0.30 \
    --prune-method reap \
    --seed 42 \
    --renormalize_router_weights true \
    --batch_size 8 \
    --batches_per_category 128

Citation

@inproceedings{lasby2025reap,
  title={{REAP} the Experts: Why Pruning Prevails for One-Shot {MoE} Compression},
  author={Lasby, Mike and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.13999}
}

Model tree for 0xSero/qwen3-coder-next-56b-REAP

Base model

Qwen/Qwen3-Coder-Next

Finetuned

(33)

this model

Quantizations

1 model

Space using 0xSero/qwen3-coder-next-56b-REAP 1

Collection including 0xSero/qwen3-coder-next-56b-REAP

Proven REAPs

Collection

Benchmarked REAP checkpoints with >=500 all-time downloads. GLM/Qwen/MiniMax/DeepSeek/Kimi/gemma. • 21 items • Updated 12 days ago • 10

Paper for 0xSero/qwen3-coder-next-56b-REAP

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19

Evaluation results

acc_norm on ARC-Challenge
self-reported

61.000
accuracy on BoolQ
self-reported

90.000
acc_norm on HellaSwag
self-reported

63.500
accuracy on WinoGrande
self-reported

66.000
acc_norm on PIQA
self-reported

79.000
accuracy on CommonsenseQA
self-reported

85.500
accuracy on TruthfulQA MC2
self-reported

55.500
acc_norm on OpenBookQA
self-reported

49.500

0xSero
/

qwen3-coder-next-56b-REAP

Qwen3-Coder-Next 56B REAP

Method

Calibration Dataset

Pruning Configuration

Benchmark Results

Architecture

Usage

Transformers

vLLM

Reproducing

Citation

Links

Sponsors

Model tree for 0xSero/qwen3-coder-next-56b-REAP

Space using 0xSero/qwen3-coder-next-56b-REAP 1

Collection including 0xSero/qwen3-coder-next-56b-REAP

Proven REAPs

Paper for 0xSero/qwen3-coder-next-56b-REAP

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Evaluation results