Model Card for petergilani/Qwen3-Coder-Next-4bit-g128

Quantized Qwen/Qwen3-Coder-Next using mlx-lm to 4-bit with group_size 128 for main weights, with the aim of maximum efficiency for 4-bit quantization.

Updated Evaluation Results (February 13, 2026)

Comprehensive evaluation results from thorough testing using mlx_lm.evaluate with mmlu_pro (200 questions per domain, num_shots=1, temp=1.0, top_p=0.95, top_k=40, seed=123):

Direct Comparison Summary (4-bit g64 vs g128)

Domain 4-bit g64 4-bit g128 (this model) Difference
Computer Science 85.5% 85.5% 0.0%
Math 92.0% 91.5% -0.5%
Physics 88.0% 91.0% +3.0%
Engineering 69.0% 71.0% +2.0%
Chemistry 86.0% 86.0% 0.0%

Average Performance

  • 4-bit g64: 84.1% average
  • 4-bit g128 (this model): 84.2% average
  • Difference: 0.1 percentage points (virtually identical)

Key Observations

  • Both 4-bit models show nearly identical average performance (84.1-84.2%)
  • Performance differences are domain-specific rather than model-specific
  • 4-bit g128 (this model) performs better in Physics and Engineering
  • 4-bit g64 performs slightly better in Math and Chemistry

Memory Usage Comparison

  • 4-bit g64:

    • Peak memory usage: 56.355 GB (Math) to 61.010 GB (Physics)
    • Average memory usage: ~60.0 GB
  • 4-bit g128 (this model):

    • Peak memory usage: 51.989 GB (Math) to 59.084 GB (Engineering)
    • Average memory usage: ~57.3 GB

Full Quantization Spectrum Comparison (Updated)

Domain 3-bit default 3-bit g128 4-bit g64 4-bit g128 (this model) 6-bit default 6-bit g128 8-bit g64 8-bit g128
Math 0.94 0.90 0.920 0.915 0.90 0.94 0.930 0.925
Computer Science 0.82 0.84 0.855 0.855 0.82 0.86 0.850 0.875
Engineering 0.70 0.64 0.690 0.710 0.74 0.72 0.710 0.685
Physics 0.94 0.92 0.880 0.910 0.96 0.94 0.920 0.885
Chemistry 0.86 0.88 0.860 0.860 0.94 0.92 0.875 0.885
Average 0.852 0.836 0.841 0.842 0.872 0.876 0.870 0.870

Original Evaluation Results

Testing with mlx_lm.evaluate using mmlu_pro with 50 Qs for each topic, comparing the 4-bit g128 quant with the 4-bit g64 quant:

Domain g64 g128 Improvement
Math 0.92 0.92 0%
Computer Science 0.84 0.84 0%
Engineering 0.70 0.76 +6%
Physics 0.94 0.96 +2%
Chemistry 0.90 0.90 0%

Average Performance

  • 4-bit g64: 84.4% average
  • 4-bit g128: 86.0% average
  • Improvement: +1.6 percentage points with group size 128

Key Benefits

  • The g128 group_size applied to main weights provides modest memory savings (~2.7 GB less on average)
  • Performance differences between group sizes are minimal for 4-bit quantization
  • 4-bit quantization provides significant memory efficiency (~40% less than 8-bit models)
  • Thermal characteristics are highly favorable compared to higher-bit quantizations, making this suitable for thermal-constrained environments

Model Details

  • Base Model: Qwen/Qwen3-Coder-Next
  • Library: mlx-lm
  • Quantization: 4-bit with group_size 128 for main weights
  • License: apache-2.0
  • Pipeline Tag: text-generation

Usage

import mlx_lm
from mlx_lm.sample_utils import make_sampler

model_path = "petergilani/Qwen3-Coder-Next-4bit"
model, tokenizer = mlx_lm.load(model_path)

sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)

prompt = "Write a Python function to calculate the factorial of a number."
response = mlx_lm.generate(
    model, 
    tokenizer, 
    prompt=prompt,
    sampler=sampler,
    max_tokens=512
)
print(response)
Downloads last month
106
Safetensors
Model size
80B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for petergilani/Qwen3-Coder-Next-4bit-g128

Quantized
(59)
this model