Model Card for Qwen3-Coder-Next-3bit-g128

Quantized Qwen/Qwen3-Coder-Next using mlx-lm to 3-bit with group_size 128 for main weights and fine-grained group_size 64 for MoE weights, with the aim of maximum accuracy for 3bit quantization.

Evaluation Results

Testing with mlx_lm.evaluate using mmlu_pro with 50 Qs for each topic, comparing the 3-bit g128 quant with the 3-bit g64 quant:

Domain	g64	g128	Improvement
Math	0.94	0.90	-4%
Computer Science	0.82	0.84	+2%
Engineering	0.70	0.64	-6%
Physics	0.94	0.92	-2%
Chemistry	0.86	0.88	+2%

Average Performance

3-bit g64/default: 85.2% average
3-bit g128: 83.6% average
Difference: -1.6 percentage points with group size 128 (main weights) with fine-grained 64 (MoE weights)

Key Benefits

The g128 group_size applied to main weights (with fine-grained g64 for MoE weights) provides better accuracy than g64 group_size for Computer Science and Chemistry domains (+2% each)
Lower memory footprint compared to higher bit quantizations (~38-45 GB)
Consistent performance in Math and Physics domains (90-94%)

Model Details

Base Model: Qwen/Qwen3-Coder-Next
Library: mlx-lm
Quantization: 3-bit with group_size 128 for main weights and group_size 64 for MoE weights
License: apache-2.0
Pipeline Tag: text-generation

Full Quantization Spectrum Comparison

Domain	3-bit default	3-bit g128	4-bit default	4-bit g128	6-bit default	6-bit g128	8-bit default	8-bit g128
Math	0.94	0.90	0.92	0.92	0.90	0.94	0.92	0.94
Computer Science	0.82	0.84	0.80	0.84	0.82	0.86	0.84	0.90
Engineering	0.70	0.64	0.70	0.76	0.74	0.72	0.80	0.80
Physics	0.94	0.92	0.94	0.96	0.96	0.94	0.96	0.96
Chemistry	0.86	0.88	0.90	0.90	0.94	0.92	0.90	0.94
Average	0.852	0.836	0.835	0.865	0.872	0.876	0.868	0.888

Key Observations

Quantization Level Impact

3-bit:
- g64/default: 85.2% average
- g128: 83.6% average (-1.6% vs g64/default)
4-bit:
- g64/default: 83.5% average
- g128: 86.5% average (+3.0% vs g64/default)
6-bit:
- g64/default: 87.2% average
- g128: 87.6% average (+0.4% vs g64/default)
8-bit:
- g64/default: 86.8% average
- g128: 88.8% average (+3.6% vs g64/default)

Group Size Impact Across Quantization Levels

Quantization	g128 vs baseline	Performance Trend
3-bit	-1.6%	g128 underperforms default
4-bit	+3.0%	g128 significantly outperforms default
6-bit	+0.4%	g128 slightly outperforms default
8-bit	+3.6%	g128 significantly outperforms default

Usage

import mlx_lm
from mlx_lm.sample_utils import make_sampler

model_path = "petergilani/qwen3-coder-next-3bit-g128"
model, tokenizer = mlx_lm.load(model_path)

sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)

prompt = "Write a Python function to calculate the factorial of a number."
response = mlx_lm.generate(
    model, 
    tokenizer, 
    prompt=prompt,
    sampler=sampler,
    max_tokens=512
)
print(response)

Downloads last month: 39

Safetensors

Model size

80B params

Tensor type

BF16

U32

Model tree for petergilani/Qwen3-Coder-Next-3bit-g128

Base model

Qwen/Qwen3-Coder-Next

Quantized

(101)

this model