Qwen2.5Code-14B - MLX mxfp4 Quantized
- Repository:
johnlockejrr/qwen2.5-coder-14b-mxfp4 - Base model: https://huggingface.co/Qwen/Qwen2.5-Coder-14B
- Quantization: MLX mxfp4 (4-bit)
- Quantized by:
johnlockejrr - Framework: MLX + mlx-lm
- Quantization tool: https://github.com/EricFillion/quantize
Model Summary
This repository contains an MLX-quantized version of Qwen2.5-Coder-14B, optimized for Apple Silicon (M1/M2/M3/M4) devices. The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 30 G B (FP16) to approximately 10.6 G B while maintaining strong coding performance.
This quantized model is suitable for:
- local coding assistants
- offline development workflows
- VS Code integration
- fast inference on Apple GUPs
- running large models on 16 G B or 24 G B Apple Silicon machines
Quantization Details
| Setting | Value |
|---|---|
| Quantization mode | mxfp4 |
| Bits per weight | 4 |
| Group size | 64 |
| Activation dtype | bfloat16 |
| Framework | MLX |
| Quantization tool | EricFillion/quantize |
Command used:
python3 quantize.py \
--model_name Qwen/Qwen2.5-Coder-14B \
--save_model_path models/qwen2.5-coder-14b-mxfp4 \
--q_mode mxfp4 \
--q_bits 4 \
--q_group_size 64
Resulting model size: approximately 10.6 G B
Running the Model (MLX)
CLI (mx-lm)
mlx_lm.generate \
--model johnlockejrr/qwen2.5-coder-14b-mxfp4 \
--prompt "Write a Python function to compute Fibonacci numbers."
Python API
from mlx_lm import load, generate
model, tokenizer = load("johnlockejrr/qwen2.5-coder-14b-mxfp4")
prompt = "Write a Python function to compute Fibonacci numbers."
output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)
Chat Mode
from mx_lm import load, chat
model, tokenizer = load("johnlockejrr/qwen2.5-coder-14b-mxfp4")
messages = [
{"role": "user", "content": "Explain what a binary search tree is."}
]
response = chat(model, tokenizer, messages)
print(response)
Performance (Mac Mini M4, 16 GB)
| Metric | Value |
|---|---|
| Generation speed | approximately 9-12 tokens/sec |
| Peak memory usage | approximately 10.7 G B |
| GPU | Apple M4 GPU |
| Framework | MLX |
Repository Contents
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md
License
This model inherits the license of the original model:
Qwen2.5-Coder-14B License: https://huggingface.co/Qwen/Qwen2.5-Coder-14B#license
Please review the license before using this model in commercial applications.
Limitations and Bias
- The model may generate incorrect or insecure code.
- It may hallucinate APIs or functions.
- It may produce biased or harmful statements if prompted.
- It should not be used for production-critical code without human review.
Acknowledgements
- Qwen Team for the original Qwen2.5-Coder-14B model
- Apple MLX Team for the MLX framework
- Eric Fillion for the MLX quantization tool
- Hugging Face for hosting the model
- Downloads last month
- 228
Model size
15B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit