Improve model card metadata and add paper/code links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +26 -25
README.md CHANGED
@@ -1,12 +1,20 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - mistralai/Mistral-Nemo-Instruct-2407
 
 
 
5
  tags:
6
  - nvfp4
7
  ---
8
 
9
  # Mistral-Nemo-Instruct-2407-NVFP4-FP8
 
 
 
 
 
 
10
  A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
11
 
12
  ## Quantization Format
@@ -21,7 +29,7 @@ One of the main downsides of using FP4 is the extreme sparsity of large values.
21
  ![image/png](four-over-six.png)
22
 
23
  However, while scaling to ±4 reduces worst-case rounding error for large values, it increases rounding error for smaller values, so simply scaling every block to ±4 would be a bad idea. The solution is to try scaling each block both ways, then keep whichever gives the lowest quantization MSE for that block. The `memoryless_mse` observer in llm-compressor is designed to work on a similar principle, calculating scale factors as though the weights were multiplied by different values of \\(p\\) and choosing the scale that minimizes quantization error for each block. While this is primarily intended for using \\(p\le1\\) to allow extra precision for small values at the cost of clipping large values, when used with NVFP4 it's mathematically equivalent to mapping the most extreme values in each block to \\(±6/p\\). Obviously, this can be used to implement Four Over Six by setting \\(p\in\{1,1.5\}\\). The key to doing this is the following code from `mse.py`:
24
- ```
25
  for i in range(int(maxshrink * grid)):
26
  p = 1 - i / grid
27
  ```
@@ -107,31 +115,24 @@ For this test, I split sample texts into \\(n\\)-token chunks and computed perpl
107
  While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
108
  ![image/png](perplexity-plot.png)
109
 
110
- ### Further Perplexity Comparison
111
-
112
- Out of curiosity, I also tried quantizing the model with a different mixed-precision recipe that quantized all `down_proj` tensors to `FP8_DYNAMIC` and the rest to NVFP4, testing versions [with](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-Down-4over6) and [without](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-Down-RTN) Four Over Six. Interestingly, while these performed better than any other at shorter context lengths, their graphs remained parallel to that of pure NVFP4 and both were overtaken by the versions with FP8 attention at longer contexts. Between this and the fact that the versions with FP8 `down_proj` were larger and thus required more VRAM, I feel confident in my assessment that FP8 attention is the better option overall.
113
-
114
- <details>
115
- <summary>Results</summary>
116
-
117
- |Tokens|FP8 `down_proj`|FP8 `down_proj` (4/6)|
118
- |-:|-:|-:|
119
- |4096|3.5965|3.4747|
120
- |8192|3.4717|3.3517|
121
- |12288|3.7064|3.5865|
122
- |16384|4.0343|3.9131|
123
- |20480|4.2567|4.1288|
124
- |24576|4.4232|4.2880|
125
- |28672|4.6076|4.4737|
126
- |32768|4.7801|4.6277|
127
-
128
- ![image/png](perplexity-all.png)
129
- </details>
130
-
131
  ## Inference
132
- This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine). If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file `aphrodite/platforms/interface.py` in your library or venv (if you've followed the [official installation instructions](https://aphrodite.pygmalion.chat/installation/installation/), it will be under `~/venv/aphrodite/lib/python3.12/site-packages`) and comment out or delete lines 487-491.
133
 
134
  ## Credits
135
  Mistral-Nemo-Instruct-2407 was made by [Mistral AI](https://huggingface.co/mistralai) and [nVidia](https://huggingface.co/nvidia)
136
 
137
- Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - mistralai/Mistral-Nemo-Instruct-2407
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
  tags:
8
  - nvfp4
9
  ---
10
 
11
  # Mistral-Nemo-Instruct-2407-NVFP4-FP8
12
+
13
+ This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010).
14
+
15
+ - **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
16
+ - **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
17
+
18
  A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
19
 
20
  ## Quantization Format
 
29
  ![image/png](four-over-six.png)
30
 
31
  However, while scaling to ±4 reduces worst-case rounding error for large values, it increases rounding error for smaller values, so simply scaling every block to ±4 would be a bad idea. The solution is to try scaling each block both ways, then keep whichever gives the lowest quantization MSE for that block. The `memoryless_mse` observer in llm-compressor is designed to work on a similar principle, calculating scale factors as though the weights were multiplied by different values of \\(p\\) and choosing the scale that minimizes quantization error for each block. While this is primarily intended for using \\(p\le1\\) to allow extra precision for small values at the cost of clipping large values, when used with NVFP4 it's mathematically equivalent to mapping the most extreme values in each block to \\(±6/p\\). Obviously, this can be used to implement Four Over Six by setting \\(p\in\{1,1.5\}\\). The key to doing this is the following code from `mse.py`:
32
+ ```python
33
  for i in range(int(maxshrink * grid)):
34
  p = 1 - i / grid
35
  ```
 
115
  While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
116
  ![image/png](perplexity-plot.png)
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ## Inference
119
+ This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine).
120
 
121
  ## Credits
122
  Mistral-Nemo-Instruct-2407 was made by [Mistral AI](https://huggingface.co/mistralai) and [nVidia](https://huggingface.co/nvidia)
123
 
124
+ Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
125
+
126
+ ## Citation
127
+
128
+ ```bibtex
129
+ @misc{cook2025sixaccuratenvfp4quantization,
130
+ title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
131
+ author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
132
+ year={2025},
133
+ eprint={2512.02010},
134
+ archivePrefix={arXiv},
135
+ primaryClass={cs.CL},
136
+ url={https://arxiv.org/abs/2512.02010},
137
+ }
138
+ ```