Qwen2.5­Code-14B - MLX mxfp4 Quantized


Model Summary

This repository contains an MLX-quantized version of Qwen2.5-Coder-14B, optimized for Apple Silicon (M1/M2/M3/M4) devices. The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 30 G B (FP16) to approximately 10.6 G B while maintaining strong coding performance.

This quantized model is suitable for:

  • local coding assistants
  • offline development workflows
  • VS Code integration
  • fast inference on Apple GUPs
  • running large models on 16 G B or 24 G B Apple Silicon machines

Quantization Details

Setting Value
Quantization mode mxfp4
Bits per weight 4
Group size 64
Activation dtype bfloat16
Framework MLX
Quantization tool EricFillion/quantize

Command used:

python3 quantize.py  \
  --model_name Qwen/Qwen2.5-Coder-14B \
  --save_model_path models/qwen2.5-coder-14b-mxfp4 \
  --q_mode mxfp4 \
  --q_bits 4 \
  --q_group_size 64

Resulting model size: approximately 10.6 G B


Running the Model (MLX)

CLI (mx-lm)

mlx_lm.generate \
  --model johnlockejrr/qwen2.5-coder-14b-mxfp4 \
  --prompt "Write a Python function to compute Fibonacci numbers."

Python API

from mlx_lm import load, generate

model, tokenizer = load("johnlockejrr/qwen2.5-coder-14b-mxfp4")

prompt = "Write a Python function to compute Fibonacci numbers."

output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)

Chat Mode

from mx_lm import load, chat

model, tokenizer = load("johnlockejrr/qwen2.5-coder-14b-mxfp4")

messages = [
  {"role": "user", "content": "Explain what a binary search tree is."}
 ]

response = chat(model, tokenizer, messages)
print(response)

Performance (Mac Mini M4, 16 GB)

Metric Value
Generation speed approximately 9-12 tokens/sec
Peak memory usage approximately 10.7 G B
GPU Apple M4 GPU
Framework MLX

Repository Contents

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md

License

This model inherits the license of the original model:

Qwen2.5-Coder-14B License: https://huggingface.co/Qwen/Qwen2.5-Coder-14B#license

Please review the license before using this model in commercial applications.


Limitations and Bias

  • The model may generate incorrect or insecure code.
  • It may hallucinate APIs or functions.
  • It may produce biased or harmful statements if prompted.
  • It should not be used for production-critical code without human review.

Acknowledgements

  • Qwen Team for the original Qwen2.5-Coder-14B model
  • Apple MLX Team for the MLX framework
  • Eric Fillion for the MLX quantization tool
  • Hugging Face for hosting the model
Downloads last month
228
Safetensors
Model size
15B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johnlockejrr/Qwen2.5-Coder-14b-mxfp4

Base model

Qwen/Qwen2.5-14B
Quantized
(21)
this model