Model Card for Qwen3-Coder-Next-3bit-g128

Quantized Qwen/Qwen3-Coder-Next using mlx-lm to 3-bit with group_size 128 for main weights and fine-grained group_size 64 for MoE weights, with the aim of maximum accuracy for 3bit quantization.

Evaluation Results

Testing with mlx_lm.evaluate using mmlu_pro with 50 Qs for each topic, comparing the 3-bit g128 quant with the 3-bit g64 quant:

Domain g64 g128 Improvement
Math 0.94 0.90 -4%
Computer Science 0.82 0.84 +2%
Engineering 0.70 0.64 -6%
Physics 0.94 0.92 -2%
Chemistry 0.86 0.88 +2%

Average Performance

  • 3-bit g64/default: 85.2% average
  • 3-bit g128: 83.6% average
  • Difference: -1.6 percentage points with group size 128 (main weights) with fine-grained 64 (MoE weights)

Key Benefits

  • The g128 group_size applied to main weights (with fine-grained g64 for MoE weights) provides better accuracy than g64 group_size for Computer Science and Chemistry domains (+2% each)
  • Lower memory footprint compared to higher bit quantizations (~38-45 GB)
  • Consistent performance in Math and Physics domains (90-94%)

Model Details

  • Base Model: Qwen/Qwen3-Coder-Next
  • Library: mlx-lm
  • Quantization: 3-bit with group_size 128 for main weights and group_size 64 for MoE weights
  • License: apache-2.0
  • Pipeline Tag: text-generation

Full Quantization Spectrum Comparison

Domain 3-bit default 3-bit g128 4-bit default 4-bit g128 6-bit default 6-bit g128 8-bit default 8-bit g128
Math 0.94 0.90 0.92 0.92 0.90 0.94 0.92 0.94
Computer Science 0.82 0.84 0.80 0.84 0.82 0.86 0.84 0.90
Engineering 0.70 0.64 0.70 0.76 0.74 0.72 0.80 0.80
Physics 0.94 0.92 0.94 0.96 0.96 0.94 0.96 0.96
Chemistry 0.86 0.88 0.90 0.90 0.94 0.92 0.90 0.94
Average 0.852 0.836 0.835 0.865 0.872 0.876 0.868 0.888

Key Observations

Quantization Level Impact

  • 3-bit:
    • g64/default: 85.2% average
    • g128: 83.6% average (-1.6% vs g64/default)
  • 4-bit:
    • g64/default: 83.5% average
    • g128: 86.5% average (+3.0% vs g64/default)
  • 6-bit:
    • g64/default: 87.2% average
    • g128: 87.6% average (+0.4% vs g64/default)
  • 8-bit:
    • g64/default: 86.8% average
    • g128: 88.8% average (+3.6% vs g64/default)

Group Size Impact Across Quantization Levels

Quantization g128 vs baseline Performance Trend
3-bit -1.6% g128 underperforms default
4-bit +3.0% g128 significantly outperforms default
6-bit +0.4% g128 slightly outperforms default
8-bit +3.6% g128 significantly outperforms default

Usage

import mlx_lm
from mlx_lm.sample_utils import make_sampler

model_path = "petergilani/qwen3-coder-next-3bit-g128"
model, tokenizer = mlx_lm.load(model_path)

sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)

prompt = "Write a Python function to calculate the factorial of a number."
response = mlx_lm.generate(
    model, 
    tokenizer, 
    prompt=prompt,
    sampler=sampler,
    max_tokens=512
)
print(response)
Downloads last month
141
Safetensors
Model size
80B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for petergilani/Qwen3-Coder-Next-3bit-g128

Quantized
(59)
this model