Model Card for petergilani/Qwen3-Coder-Next-4bit-g128
Quantized Qwen/Qwen3-Coder-Next using mlx-lm to 4-bit with group_size 128 for main weights, with the aim of maximum efficiency for 4-bit quantization.
Updated Evaluation Results (February 13, 2026)
Comprehensive evaluation results from thorough testing using mlx_lm.evaluate with mmlu_pro (200 questions per domain, num_shots=1, temp=1.0, top_p=0.95, top_k=40, seed=123):
Direct Comparison Summary (4-bit g64 vs g128)
| Domain | 4-bit g64 | 4-bit g128 (this model) | Difference |
|---|---|---|---|
| Computer Science | 85.5% | 85.5% | 0.0% |
| Math | 92.0% | 91.5% | -0.5% |
| Physics | 88.0% | 91.0% | +3.0% |
| Engineering | 69.0% | 71.0% | +2.0% |
| Chemistry | 86.0% | 86.0% | 0.0% |
Average Performance
- 4-bit g64: 84.1% average
- 4-bit g128 (this model): 84.2% average
- Difference: 0.1 percentage points (virtually identical)
Key Observations
- Both 4-bit models show nearly identical average performance (84.1-84.2%)
- Performance differences are domain-specific rather than model-specific
- 4-bit g128 (this model) performs better in Physics and Engineering
- 4-bit g64 performs slightly better in Math and Chemistry
Memory Usage Comparison
4-bit g64:
- Peak memory usage: 56.355 GB (Math) to 61.010 GB (Physics)
- Average memory usage: ~60.0 GB
4-bit g128 (this model):
- Peak memory usage: 51.989 GB (Math) to 59.084 GB (Engineering)
- Average memory usage: ~57.3 GB
Full Quantization Spectrum Comparison (Updated)
| Domain | 3-bit default | 3-bit g128 | 4-bit g64 | 4-bit g128 (this model) | 6-bit default | 6-bit g128 | 8-bit g64 | 8-bit g128 |
|---|---|---|---|---|---|---|---|---|
| Math | 0.94 | 0.90 | 0.920 | 0.915 | 0.90 | 0.94 | 0.930 | 0.925 |
| Computer Science | 0.82 | 0.84 | 0.855 | 0.855 | 0.82 | 0.86 | 0.850 | 0.875 |
| Engineering | 0.70 | 0.64 | 0.690 | 0.710 | 0.74 | 0.72 | 0.710 | 0.685 |
| Physics | 0.94 | 0.92 | 0.880 | 0.910 | 0.96 | 0.94 | 0.920 | 0.885 |
| Chemistry | 0.86 | 0.88 | 0.860 | 0.860 | 0.94 | 0.92 | 0.875 | 0.885 |
| Average | 0.852 | 0.836 | 0.841 | 0.842 | 0.872 | 0.876 | 0.870 | 0.870 |
Original Evaluation Results
Testing with mlx_lm.evaluate using mmlu_pro with 50 Qs for each topic, comparing the 4-bit g128 quant with the 4-bit g64 quant:
| Domain | g64 | g128 | Improvement |
|---|---|---|---|
| Math | 0.92 | 0.92 | 0% |
| Computer Science | 0.84 | 0.84 | 0% |
| Engineering | 0.70 | 0.76 | +6% |
| Physics | 0.94 | 0.96 | +2% |
| Chemistry | 0.90 | 0.90 | 0% |
Average Performance
- 4-bit g64: 84.4% average
- 4-bit g128: 86.0% average
- Improvement: +1.6 percentage points with group size 128
Key Benefits
- The g128 group_size applied to main weights provides modest memory savings (~2.7 GB less on average)
- Performance differences between group sizes are minimal for 4-bit quantization
- 4-bit quantization provides significant memory efficiency (~40% less than 8-bit models)
- Thermal characteristics are highly favorable compared to higher-bit quantizations, making this suitable for thermal-constrained environments
Model Details
- Base Model: Qwen/Qwen3-Coder-Next
- Library: mlx-lm
- Quantization: 4-bit with group_size 128 for main weights
- License: apache-2.0
- Pipeline Tag: text-generation
Usage
import mlx_lm
from mlx_lm.sample_utils import make_sampler
model_path = "petergilani/Qwen3-Coder-Next-4bit"
model, tokenizer = mlx_lm.load(model_path)
sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)
prompt = "Write a Python function to calculate the factorial of a number."
response = mlx_lm.generate(
model,
tokenizer,
prompt=prompt,
sampler=sampler,
max_tokens=512
)
print(response)
- Downloads last month
- 106
Model tree for petergilani/Qwen3-Coder-Next-4bit-g128
Base model
Qwen/Qwen3-Coder-Next