Update README.md
Browse files
README.md
CHANGED
|
@@ -20,4 +20,11 @@ tags:
|
|
| 20 |
Quantized NVFP4 weights of the [Muse-12B](https://huggingface.co/LatitudeGames/Muse-12B) model, for use with nVidia Blackwell GPUs.
|
| 21 |
|
| 22 |
## Quantization details
|
| 23 |
-
Uses [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
Quantized NVFP4 weights of the [Muse-12B](https://huggingface.co/LatitudeGames/Muse-12B) model, for use with nVidia Blackwell GPUs.
|
| 21 |
|
| 22 |
## Quantization details
|
| 23 |
+
Uses [4-over-6](https://arxiv.org/abs/2512.02010) adaptive block scaling with MSE selection for the weights, done by using the `memoryless_mse` observer for llm-compressor with `maxshrink` and `grid` set to negative numbers.
|
| 24 |
+
|
| 25 |
+
### A Brief Overview of 4-over-6
|
| 26 |
+
One of the main downsides of using FP4 is the extreme sparsity of large values. At a base level, NVFP4 works by dividing the model into sixteen-element blocks, then assigning FP8 scale factors to each block (as well as a single FP32 scale factor for the tensor as a whole) such that the largest absolute value in the block maps to ±6. For example, if a block has the values {10,-20,40,-60}, the scale factor would be set to 10 and the FP4 values would be {1,-2,4,-6}. The problem is that the FP4 format only allows for a very limited set of values. In particular, it can't represent any number between 4 and 6, so anything in the block that maps to 5 will be severely affected by rounding error. Changing the scale factors so that the maximum value for the block maps to ±4 reduces the maximum possible error introduced by this type of rounding, but it also increases the rounding error in very small values, so it's not a good idea to just quantize the entire model with the maximum value set to 4.
|
| 27 |
+
|
| 28 |
+

|
| 29 |
+
|
| 30 |
+
### My Implementation
|