Phi-2 MLX 4-bit

This repository provides a 4-bit MLX-quantized version of Microsoft Phi-2, optimized for fast, low-memory local inference on Apple Silicon.

This variant prioritizes speed and minimal RAM usage, making it ideal for laptops and on-device experimentation.


Model Details

  • Base model: microsoft/phi-2
  • Architecture: Decoder-only Transformer
  • License: MIT
  • Quantization: MLX static quantization (≈4.5 bits per weight)
  • Target hardware: Apple Silicon (M1 / M2 / M3)

Performance Characteristics

Metric Value
Disk size ~1.5–1.7 GB
Peak RAM usage ~1.6–1.8 GB
Inference speed Fast
Instruction quality Good

Usage

mlx_lm.generate \
  --model /path/to/Phi-2-MLX-4bit \
  --prompt "Explain the FFT in simple terms." \
  --max-tokens 120

Notes

  • This is a quantized conversion, not a fine-tune.
  • The 4-bit version is best for:
    • faster inference
    • lower memory usage
    • interactive local testing
  • For higher-quality reasoning and instruction-following, see the 5-bit variant.

License

This repository redistributes a quantized MLX conversion of Microsoft Phi-2.

  • Original model license: MIT
  • MLX conversion: MIT

See LICENSE for details.

Downloads last month
21
Safetensors
Model size
0.4B params
Tensor type
F16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Irfanuruchi/Phi-2-MLX-4bit

Base model

microsoft/phi-2
Quantized
(53)
this model