Otilde/SmolLM-135M-Dynamic-Q5-MLX

Original Model: HuggingFaceTB/SmolLM-135M

This is an attempt to test the limits of dynamic quantization using consumer hardware with the default settings of mlx_lm.dynamic_quant. The model size is about the maximum my M1 Max 32 GB laptop can handle before performance throttling. During the run, the laptop had to turn off the screen because there were insufficient resources to allocate to the WindowServer process, but after a while it managed to finish and produce the sensitivities.json file required for the quantization. While it might be possible to push the limits further by trying a 1B parameter model, essentially no other apps can run simultaneously.

For users in a similar situation who want to create dynamic quantizations on their hardware without crashes or reboots, it is recommended to throttle the system using these environment variables:

export OMP_NUM_THREADS=2 This is a Open Multi-Processing environment variable that sets the default number of threads used for parallel regions. You will likely need to install libomp for this variable to work. If you have homebrew installed you can use brew install libomp

export MKL_NUM_THREADS=2 This sets Intel MKL thread count. If you have an Intel device you can use this. This may work on apple silicon, but it is often ignored due to compatibility issues.

Downloads last month
42
Safetensors
Model size
21.1M params
Tensor type
F32
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Otilde/SmolLM-135M-Dynamic-Q5-MLX

Quantized
(22)
this model

Dataset used to train Otilde/SmolLM-135M-Dynamic-Q5-MLX

Collection including Otilde/SmolLM-135M-Dynamic-Q5-MLX