Otilde commited on
Commit
38ee81c
·
verified ·
1 Parent(s): 18f97eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -10,10 +10,14 @@ tags:
10
  - mlx
11
  base_model: HuggingFaceTB/SmolLM-135M
12
  ---
 
 
 
 
13
  This is an attempt to test the limits of dynamic quantization using consumer hardware with the default settings of `mlx_lm.dynamic_quant`. The model size is about the maximum my M1 Max 32 GB laptop can handle before performance throttling. During the run, the laptop had to turn off the screen because there were insufficient resources to allocate to the WindowServer process, but after a while it managed to finish and produce the sensitivities.json file required for the quantization. While it might be possible to push the limits further by trying a 1B parameter model, essentially no other apps can run simultaneously.
14
 
15
  For users in a similar situation who want to create dynamic quantizations on their hardware without crashes or reboots, it is recommended to throttle the system using these environment variables:
16
 
17
  `export OMP_NUM_THREADS=2` This is a Open Multi-Processing environment variable that sets the default number of threads used for parallel regions. You will likely need to install libomp for this variable to work. If you have homebrew installed you can use `brew install libomp`
18
 
19
- `export MKL_NUM_THREADS=2` This sets Intel MKL thread count. If you have an Intel device you can use this. This may work on apple silicon, but it is often ignored due to compatibility issues.
 
10
  - mlx
11
  base_model: HuggingFaceTB/SmolLM-135M
12
  ---
13
+ # Otilde/SmolLM-135M-Dynamic-Q5-MLX
14
+
15
+ Original Model: [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
16
+
17
  This is an attempt to test the limits of dynamic quantization using consumer hardware with the default settings of `mlx_lm.dynamic_quant`. The model size is about the maximum my M1 Max 32 GB laptop can handle before performance throttling. During the run, the laptop had to turn off the screen because there were insufficient resources to allocate to the WindowServer process, but after a while it managed to finish and produce the sensitivities.json file required for the quantization. While it might be possible to push the limits further by trying a 1B parameter model, essentially no other apps can run simultaneously.
18
 
19
  For users in a similar situation who want to create dynamic quantizations on their hardware without crashes or reboots, it is recommended to throttle the system using these environment variables:
20
 
21
  `export OMP_NUM_THREADS=2` This is a Open Multi-Processing environment variable that sets the default number of threads used for parallel regions. You will likely need to install libomp for this variable to work. If you have homebrew installed you can use `brew install libomp`
22
 
23
+ `export MKL_NUM_THREADS=2` This sets Intel MKL thread count. If you have an Intel device you can use this. This may work on apple silicon, but it is often ignored due to compatibility issues.