Model Card for Qwen3-Coder-Next-8bit-g128

Quantized Qwen/Qwen3-Coder-Next using mlx-lm to 8-bit with group_size 128 for main weights and fine-grained group_size 64 for MoE weights, with the aim of maximum accuracy for 8-bit quantization.

Updated Evaluation Results (February 13, 2026)

Comprehensive evaluation results from thorough testing using mlx_lm.evaluate with mmlu_pro (200 questions per domain, num_shots=1, temp=1.0, top_p=0.95, top_k=40, seed=123):

Direct Comparison Summary (8-bit g64 vs g128)

Domain	8-bit g64	8-bit g128 (this model)	Difference
Computer Science	85.0%	87.5%	+2.5%
Math	93.0%	92.5%	-0.5%
Physics	92.0%	88.5%	-3.5%
Engineering	71.0%	68.5%	-2.5%
Chemistry	87.5%	88.5%	+1.0%

Average Performance

8-bit g64: 87.0% average
8-bit g128 (this model): 87.0% average
Difference: 0.0 percentage points (virtually identical)

Key Observations

Both 8-bit models show nearly identical average performance (87.0%)
Performance differences are domain-specific rather than model-specific
8-bit g64 performs better in Math, Engineering, and Physics
8-bit g128 (this model) performs better in Computer Science and Chemistry, with Computer Science being the most significant improvement (+2.5%)

Memory Usage Comparison

8-bit g64:
- Peak memory usage: 95.374 GB (Math) to 100.740 GB (Engineering)
- Average memory usage: ~98.6 GB
8-bit g128 (this model):
- Peak memory usage: 97.702 GB (Computer Science) to 98.510 GB (Engineering)
- Average memory usage: ~97.8 GB

Full Quantization Spectrum Comparison (Updated)

Domain	3-bit default	3-bit g128	4-bit default	4-bit g128	6-bit default	6-bit g128	8-bit g64	8-bit g128 (this model)
Math	0.94	0.90	0.92	0.92	0.90	0.94	0.930	0.925
Computer Science	0.82	0.84	0.80	0.84	0.82	0.86	0.850	0.875
Engineering	0.70	0.64	0.70	0.76	0.74	0.72	0.710	0.685
Physics	0.94	0.92	0.94	0.96	0.96	0.94	0.920	0.885
Chemistry	0.86	0.88	0.90	0.90	0.94	0.92	0.875	0.885
Average	0.852	0.836	0.835	0.865	0.872	0.876	0.870	0.870

Original Evaluation Results

Testing with mlx_lm.evaluate using mmlu_pro with 50 Qs for each topic, comparing the 8-bit g128 quant with the 8-bit g64 quant:

Domain	g64	g128	Improvement
Math	0.92	0.94	+2%
Computer Science	0.84	0.90	+6%
Engineering	0.80	0.80	=
Physics	0.96	0.96	=
Chemistry	0.90	0.94	+4%

Average Performance

8-bit g64: 84.4% average
8-bit g128: 88.0% average
Improvement: +3.6 percentage points with group size 128 (main weights) with fine-grained 64 (MoE weights)

Key Benefits

The g128 group_size applied to main weights (with fine-grained g64 for MoE weights) is also ~2GB smaller than the g64 version
At the time of uploading, there were not any 8bit g128 quants of this model available on HF
Also tried MXFP8 but found equal or lower accuracy than g64 for all topics (Engineering dropped to 0.76 and Chemistry dropped to 0.88 (-4% and -2% compared to g64))

Model Details

Base Model: Qwen/Qwen3-Coder-Next
Library: mlx-lm
Quantization: 8-bit with group_size 128 for main weights and group_size 64 for MoE weights
License: apache-2.0
Pipeline Tag: text-generation

Usage

import mlx_lm
from mlx_lm.sample_utils import make_sampler

model_path = "petergilani/qwen3-coder-next-8bit-g128"
model, tokenizer = mlx_lm.load(model_path)

sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)

prompt = "Write a Python function to calculate the factorial of a number."
response = mlx_lm.generate(
    model, 
    tokenizer, 
    prompt=prompt,
    sampler=sampler,
    max_tokens=512
)
print(response)

Downloads last month: 8

Safetensors

Model size

80B params

Tensor type

BF16

U32

Model tree for petergilani/Qwen3-Coder-Next-8bit-g128

Base model

Qwen/Qwen3-Coder-Next

Quantized

(105)

this model