Update README.md
Browse files
README.md
CHANGED
|
@@ -27,6 +27,45 @@ tags:
|
|
| 27 |

|
| 28 |

|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
### How to create optimized quants
|
| 31 |
|
| 32 |
It's possible to produce quants that are better for a given size than the ones you get by performing a quant directly to a given target bitrate. The process involves comparing two quants, measuring which modules are more affected by the quantization process, and selecting those modules first when targeting some in-between bitrate.
|
|
@@ -64,10 +103,6 @@ Where:
|
|
| 64 |
You can use a measurement script from one pair of quants with another pair of quants of the same model. When I tried to use 2.0bpw and 4.0bpw quants to create a 2.25bpw quant, the size of the resulting model was larger than requested because of the substitution at 2.48 bpw, but it was still an improvement over a straight 2.48bpw quant. An explicitly-requested 2.48bpw quant drawing from the 2.0bpw and 3.0bpw quants proved to be even better (in terms of k/l divergence). Finally, I tried creating a 3.25bpw quant from 3.0bpw and 4.0bpw quants, still using my 2.0-vs-3.0 measurement file. This was not as successful as the optimized 2.25bpw quant, and may have benefitted from a 'correct' measurement file that matched the two actual sources.
|
| 65 |
</details>
|
| 66 |
|
| 67 |
-
[measurement.json - 2.0bpw_H6 vs 3.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-2.0-3.0.json)
|
| 68 |
-
[measurement.json - 3.0bpw_H6 vs 4.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-3.0-4.0.json)
|
| 69 |
-
[measurement.json - 4.0bpw_H6 vs 5.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-4.0-5.0.json)
|
| 70 |
-
|
| 71 |
### How to measure Perplexity and KL Divergence
|
| 72 |
|
| 73 |
<details>
|
|
|
|
| 27 |

|
| 28 |

|
| 29 |
|
| 30 |
+
### Measurements for creating optimized quants
|
| 31 |
+
|
| 32 |
+
[measurement.json - 2.0bpw_H6 vs 3.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-2.0-3.0.json)
|
| 33 |
+
[measurement.json - 3.0bpw_H6 vs 4.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-3.0-4.0.json)
|
| 34 |
+
[measurement.json - 4.0bpw_H6 vs 5.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-4.0-5.0.json)
|
| 35 |
+
|
| 36 |
+
### How to use these quants
|
| 37 |
+
|
| 38 |
+
The documentation for [exllamav3](https://github.com/turboderp-org/exllamav3/) is your best bet here, as wall as that of [TabbyAPI](https://github.com/theroyallab/tabbyAPI) or [Text Generation Web UI (oobabooga)](https://github.com/oobabooga/text-generation-webui). In short:
|
| 39 |
+
* You need to have sufficient VRAM to fit the model and your context cache. I give some pointers above that may be helpful.
|
| 40 |
+
* At this point, your GPUs need to be nVidia. AMD/ROCm, Intel, and offloading to system RAM are not currently supported.
|
| 41 |
+
* You will need a software package capable of loading exllamav3 models. I'm still somewhat partial to oobabooga, but TabbyAPI is another popular option. Follow the documenation for your choice in order to get yourself set up.
|
| 42 |
+
|
| 43 |
+
### How to create a quant
|
| 44 |
+
|
| 45 |
+
The documentation for [exllamav3](https://github.com/turboderp-org/exllamav3/) is again the authoritative source. But for a short primer, click below to continue.
|
| 46 |
+
|
| 47 |
+
<details>
|
| 48 |
+
<summary>Expand for more details</summary>
|
| 49 |
+
Quantization happens a layer at a time, so you don't need nearly as much VRAM to quant as you do to load the whole model.
|
| 50 |
+
|
| 51 |
+
Not all architectures are supported by exllamav3. Check the documentation to ensure the model you want to quantize is supported.
|
| 52 |
+
|
| 53 |
+
To create a quant, you'll need to:
|
| 54 |
+
* Download your source model
|
| 55 |
+
* git clone exllamav3
|
| 56 |
+
* Set up a Python environment with all requirements from requirements.txt
|
| 57 |
+
* Run convert.py:
|
| 58 |
+
```bash
|
| 59 |
+
python convert.py -w [path/to/work_area] -i [path/to/source_model] -o [path/to/output_model] -b [bitrate] -hb [head bitrate]
|
| 60 |
+
```
|
| 61 |
+
Where:
|
| 62 |
+
* `path/to/work_area` is a folder where the script can save intermediate checkpoints as it works. If the process crashes, you can pass the `--resume` flag to pick up from where it left off.
|
| 63 |
+
* `path/to/source_model` folder containing the source model you downloaded
|
| 64 |
+
* `path/to/output_model` destination folder for your completed quant (will be created if it does not exist)
|
| 65 |
+
* `bitrate` The average number of bits to use for each weight. Needs to be a float (pass `4.0` if you want just 4 even).
|
| 66 |
+
* `head bitrate` Number of bits to use for attention head weights. 6 is usually most useful here. 8 is generally considered overkill, but may be useful in some situations.
|
| 67 |
+
</details>
|
| 68 |
+
|
| 69 |
### How to create optimized quants
|
| 70 |
|
| 71 |
It's possible to produce quants that are better for a given size than the ones you get by performing a quant directly to a given target bitrate. The process involves comparing two quants, measuring which modules are more affected by the quantization process, and selecting those modules first when targeting some in-between bitrate.
|
|
|
|
| 103 |
You can use a measurement script from one pair of quants with another pair of quants of the same model. When I tried to use 2.0bpw and 4.0bpw quants to create a 2.25bpw quant, the size of the resulting model was larger than requested because of the substitution at 2.48 bpw, but it was still an improvement over a straight 2.48bpw quant. An explicitly-requested 2.48bpw quant drawing from the 2.0bpw and 3.0bpw quants proved to be even better (in terms of k/l divergence). Finally, I tried creating a 3.25bpw quant from 3.0bpw and 4.0bpw quants, still using my 2.0-vs-3.0 measurement file. This was not as successful as the optimized 2.25bpw quant, and may have benefitted from a 'correct' measurement file that matched the two actual sources.
|
| 104 |
</details>
|
| 105 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
### How to measure Perplexity and KL Divergence
|
| 107 |
|
| 108 |
<details>
|