Phi-2 MLX 4-bit
This repository provides a 4-bit MLX-quantized version of Microsoft Phi-2, optimized for fast, low-memory local inference on Apple Silicon.
This variant prioritizes speed and minimal RAM usage, making it ideal for laptops and on-device experimentation.
Model Details
- Base model: microsoft/phi-2
- Architecture: Decoder-only Transformer
- License: MIT
- Quantization: MLX static quantization (≈4.5 bits per weight)
- Target hardware: Apple Silicon (M1 / M2 / M3)
Performance Characteristics
| Metric | Value |
|---|---|
| Disk size | ~1.5–1.7 GB |
| Peak RAM usage | ~1.6–1.8 GB |
| Inference speed | Fast |
| Instruction quality | Good |
Usage
mlx_lm.generate \
--model /path/to/Phi-2-MLX-4bit \
--prompt "Explain the FFT in simple terms." \
--max-tokens 120
Notes
- This is a quantized conversion, not a fine-tune.
- The 4-bit version is best for:
- faster inference
- lower memory usage
- interactive local testing
- For higher-quality reasoning and instruction-following, see the 5-bit variant.
License
This repository redistributes a quantized MLX conversion of Microsoft Phi-2.
- Original model license: MIT
- MLX conversion: MIT
See LICENSE for details.
- Downloads last month
- 21
Model size
0.4B params
Tensor type
F16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for Irfanuruchi/Phi-2-MLX-4bit
Base model
microsoft/phi-2