Model Card for Qwen3-Coder-Next-8bit-g128
Quantized Qwen/Qwen3-Coder-Next using mlx-lm to 8-bit with group_size 128 for main weights and fine-grained group_size 64 for MoE weights, with the aim of maximum accuracy for 8-bit quantization.
Updated Evaluation Results (February 13, 2026)
Comprehensive evaluation results from thorough testing using mlx_lm.evaluate with mmlu_pro (200 questions per domain, num_shots=1, temp=1.0, top_p=0.95, top_k=40, seed=123):
Direct Comparison Summary (8-bit g64 vs g128)
| Domain | 8-bit g64 | 8-bit g128 (this model) | Difference |
|---|---|---|---|
| Computer Science | 85.0% | 87.5% | +2.5% |
| Math | 93.0% | 92.5% | -0.5% |
| Physics | 92.0% | 88.5% | -3.5% |
| Engineering | 71.0% | 68.5% | -2.5% |
| Chemistry | 87.5% | 88.5% | +1.0% |
Average Performance
- 8-bit g64: 87.0% average
- 8-bit g128 (this model): 87.0% average
- Difference: 0.0 percentage points (virtually identical)
Key Observations
- Both 8-bit models show nearly identical average performance (87.0%)
- Performance differences are domain-specific rather than model-specific
- 8-bit g64 performs better in Math, Engineering, and Physics
- 8-bit g128 (this model) performs better in Computer Science and Chemistry, with Computer Science being the most significant improvement (+2.5%)
Memory Usage Comparison
8-bit g64:
- Peak memory usage: 95.374 GB (Math) to 100.740 GB (Engineering)
- Average memory usage: ~98.6 GB
8-bit g128 (this model):
- Peak memory usage: 97.702 GB (Computer Science) to 98.510 GB (Engineering)
- Average memory usage: ~97.8 GB
Full Quantization Spectrum Comparison (Updated)
| Domain | 3-bit default | 3-bit g128 | 4-bit default | 4-bit g128 | 6-bit default | 6-bit g128 | 8-bit g64 | 8-bit g128 (this model) |
|---|---|---|---|---|---|---|---|---|
| Math | 0.94 | 0.90 | 0.92 | 0.92 | 0.90 | 0.94 | 0.930 | 0.925 |
| Computer Science | 0.82 | 0.84 | 0.80 | 0.84 | 0.82 | 0.86 | 0.850 | 0.875 |
| Engineering | 0.70 | 0.64 | 0.70 | 0.76 | 0.74 | 0.72 | 0.710 | 0.685 |
| Physics | 0.94 | 0.92 | 0.94 | 0.96 | 0.96 | 0.94 | 0.920 | 0.885 |
| Chemistry | 0.86 | 0.88 | 0.90 | 0.90 | 0.94 | 0.92 | 0.875 | 0.885 |
| Average | 0.852 | 0.836 | 0.835 | 0.865 | 0.872 | 0.876 | 0.870 | 0.870 |
Original Evaluation Results
Testing with mlx_lm.evaluate using mmlu_pro with 50 Qs for each topic, comparing the 8-bit g128 quant with the 8-bit g64 quant:
| Domain | g64 | g128 | Improvement |
|---|---|---|---|
| Math | 0.92 | 0.94 | +2% |
| Computer Science | 0.84 | 0.90 | +6% |
| Engineering | 0.80 | 0.80 | = |
| Physics | 0.96 | 0.96 | = |
| Chemistry | 0.90 | 0.94 | +4% |
Average Performance
- 8-bit g64: 84.4% average
- 8-bit g128: 88.0% average
- Improvement: +3.6 percentage points with group size 128 (main weights) with fine-grained 64 (MoE weights)
Key Benefits
- The g128 group_size applied to main weights (with fine-grained g64 for MoE weights) is also ~2GB smaller than the g64 version
- At the time of uploading, there were not any 8bit g128 quants of this model available on HF
- Also tried MXFP8 but found equal or lower accuracy than g64 for all topics (Engineering dropped to 0.76 and Chemistry dropped to 0.88 (-4% and -2% compared to g64))
Model Details
- Base Model: Qwen/Qwen3-Coder-Next
- Library: mlx-lm
- Quantization: 8-bit with group_size 128 for main weights and group_size 64 for MoE weights
- License: apache-2.0
- Pipeline Tag: text-generation
Usage
import mlx_lm
from mlx_lm.sample_utils import make_sampler
model_path = "petergilani/qwen3-coder-next-8bit-g128"
model, tokenizer = mlx_lm.load(model_path)
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)
prompt = "Write a Python function to calculate the factorial of a number."
response = mlx_lm.generate(
model,
tokenizer,
prompt=prompt,
sampler=sampler,
max_tokens=512
)
print(response)
- Downloads last month
- 84
Model tree for petergilani/Qwen3-Coder-Next-8bit-g128
Base model
Qwen/Qwen3-Coder-Next