Qwen2.5Code-14B - MLX mxfp4 Quantized

Repository: johnlockejrr/qwen2.5-coder-14b-mxfp4
Base model: https://huggingface.co/Qwen/Qwen2.5-Coder-14B
Quantization: MLX mxfp4 (4-bit)
Quantized by: johnlockejrr
Framework: MLX + mlx-lm
Quantization tool: https://github.com/EricFillion/quantize

Model Summary

This repository contains an MLX-quantized version of Qwen2.5-Coder-14B, optimized for Apple Silicon (M1/M2/M3/M4) devices. The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 30 G B (FP16) to approximately 10.6 G B while maintaining strong coding performance.

This quantized model is suitable for:

local coding assistants
offline development workflows
VS Code integration
fast inference on Apple GUPs
running large models on 16 G B or 24 G B Apple Silicon machines

Quantization Details

Setting	Value
Quantization mode	mxfp4
Bits per weight	4
Group size	64
Activation dtype	bfloat16
Framework	MLX
Quantization tool	EricFillion/quantize

Command used:

python3 quantize.py  \
  --model_name Qwen/Qwen2.5-Coder-14B \
  --save_model_path models/qwen2.5-coder-14b-mxfp4 \
  --q_mode mxfp4 \
  --q_bits 4 \
  --q_group_size 64

Resulting model size: approximately 10.6 G B

Running the Model (MLX)

CLI (mx-lm)

mlx_lm.generate \
  --model johnlockejrr/qwen2.5-coder-14b-mxfp4 \
  --prompt "Write a Python function to compute Fibonacci numbers."

Python API

from mlx_lm import load, generate

model, tokenizer = load("johnlockejrr/qwen2.5-coder-14b-mxfp4")

prompt = "Write a Python function to compute Fibonacci numbers."

output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)

Chat Mode

from mx_lm import load, chat

model, tokenizer = load("johnlockejrr/qwen2.5-coder-14b-mxfp4")

messages = [
  {"role": "user", "content": "Explain what a binary search tree is."}
 ]

response = chat(model, tokenizer, messages)
print(response)

Performance (Mac Mini M4, 16 GB)

Metric	Value
Generation speed	approximately 9-12 tokens/sec
Peak memory usage	approximately 10.7 G B
GPU	Apple M4 GPU
Framework	MLX

Repository Contents

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md

License

This model inherits the license of the original model:

Qwen2.5-Coder-14B License: https://huggingface.co/Qwen/Qwen2.5-Coder-14B#license

Please review the license before using this model in commercial applications.

Limitations and Bias

The model may generate incorrect or insecure code.
It may hallucinate APIs or functions.
It may produce biased or harmful statements if prompted.
It should not be used for production-critical code without human review.

Acknowledgements

Qwen Team for the original Qwen2.5-Coder-14B model
Apple MLX Team for the MLX framework
Eric Fillion for the MLX quantization tool
Hugging Face for hosting the model

Downloads last month: 228

Safetensors

Model size

15B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for johnlockejrr/Qwen2.5-Coder-14b-mxfp4

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-Coder-14B

Quantized

(21)

this model

Qwen2.5­Code-14B - MLX mxfp4 Quantized