Model Card for Qwen3-Coder-Next-8bit-g128

Quantized Qwen/Qwen3-Coder-Next using mlx-lm to 8-bit with group_size 128 for main weights and fine-grained group_size 64 for MoE weights, with the aim of maximum accuracy for 8-bit quantization.

Updated Evaluation Results (February 13, 2026)

Comprehensive evaluation results from thorough testing using mlx_lm.evaluate with mmlu_pro (200 questions per domain, num_shots=1, temp=1.0, top_p=0.95, top_k=40, seed=123):

Direct Comparison Summary (8-bit g64 vs g128)

Domain 8-bit g64 8-bit g128 (this model) Difference
Computer Science 85.0% 87.5% +2.5%
Math 93.0% 92.5% -0.5%
Physics 92.0% 88.5% -3.5%
Engineering 71.0% 68.5% -2.5%
Chemistry 87.5% 88.5% +1.0%

Average Performance

  • 8-bit g64: 87.0% average
  • 8-bit g128 (this model): 87.0% average
  • Difference: 0.0 percentage points (virtually identical)

Key Observations

  • Both 8-bit models show nearly identical average performance (87.0%)
  • Performance differences are domain-specific rather than model-specific
  • 8-bit g64 performs better in Math, Engineering, and Physics
  • 8-bit g128 (this model) performs better in Computer Science and Chemistry, with Computer Science being the most significant improvement (+2.5%)

Memory Usage Comparison

  • 8-bit g64:

    • Peak memory usage: 95.374 GB (Math) to 100.740 GB (Engineering)
    • Average memory usage: ~98.6 GB
  • 8-bit g128 (this model):

    • Peak memory usage: 97.702 GB (Computer Science) to 98.510 GB (Engineering)
    • Average memory usage: ~97.8 GB

Full Quantization Spectrum Comparison (Updated)

Domain 3-bit default 3-bit g128 4-bit default 4-bit g128 6-bit default 6-bit g128 8-bit g64 8-bit g128 (this model)
Math 0.94 0.90 0.92 0.92 0.90 0.94 0.930 0.925
Computer Science 0.82 0.84 0.80 0.84 0.82 0.86 0.850 0.875
Engineering 0.70 0.64 0.70 0.76 0.74 0.72 0.710 0.685
Physics 0.94 0.92 0.94 0.96 0.96 0.94 0.920 0.885
Chemistry 0.86 0.88 0.90 0.90 0.94 0.92 0.875 0.885
Average 0.852 0.836 0.835 0.865 0.872 0.876 0.870 0.870

Original Evaluation Results

Testing with mlx_lm.evaluate using mmlu_pro with 50 Qs for each topic, comparing the 8-bit g128 quant with the 8-bit g64 quant:

Domain g64 g128 Improvement
Math 0.92 0.94 +2%
Computer Science 0.84 0.90 +6%
Engineering 0.80 0.80 =
Physics 0.96 0.96 =
Chemistry 0.90 0.94 +4%

Average Performance

  • 8-bit g64: 84.4% average
  • 8-bit g128: 88.0% average
  • Improvement: +3.6 percentage points with group size 128 (main weights) with fine-grained 64 (MoE weights)

Key Benefits

  • The g128 group_size applied to main weights (with fine-grained g64 for MoE weights) is also ~2GB smaller than the g64 version
  • At the time of uploading, there were not any 8bit g128 quants of this model available on HF
  • Also tried MXFP8 but found equal or lower accuracy than g64 for all topics (Engineering dropped to 0.76 and Chemistry dropped to 0.88 (-4% and -2% compared to g64))

Model Details

  • Base Model: Qwen/Qwen3-Coder-Next
  • Library: mlx-lm
  • Quantization: 8-bit with group_size 128 for main weights and group_size 64 for MoE weights
  • License: apache-2.0
  • Pipeline Tag: text-generation

Usage

import mlx_lm
from mlx_lm.sample_utils import make_sampler

model_path = "petergilani/qwen3-coder-next-8bit-g128"
model, tokenizer = mlx_lm.load(model_path)

sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)

prompt = "Write a Python function to calculate the factorial of a number."
response = mlx_lm.generate(
    model, 
    tokenizer, 
    prompt=prompt,
    sampler=sampler,
    max_tokens=512
)
print(response)
Downloads last month
84
Safetensors
Model size
80B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for petergilani/Qwen3-Coder-Next-8bit-g128

Quantized
(59)
this model