DataSnake commited on
Commit
9c6a437
·
verified ·
1 Parent(s): 84657dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -10,17 +10,16 @@ tags:
10
 
11
  # Mistral-Nemo-Instruct-2407-NVFP4-FP8
12
 
13
- This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010).
14
-
15
- - **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
16
- - **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
17
-
18
  A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
19
 
 
 
20
  ## Quantization Format
21
  The quant uses a mixed-precision approach, with different types of layers undergoing different amounts of quantization.
22
  - Self-attention layers: `FP8_DYNAMIC`, a format that uses FP8 weights and activations, with per-channel BF16 scales for the weights and dynamic per-token scales for the activations. Self-attention tensors comprise slightly less than 20 percent of the weights in the linear layers, meaning that upgrading them from NVFP4 to FP8 has a very limited effect on model size and VRAM use, and the nature of attention means that as context length increases, the benefits of doing attention calculations in w8a8 rather than w4a4 become more pronounced.
23
- - MLP layers: NVFP4 with [Four Over Six](https://arxiv.org/abs/2512.02010) adaptive block scaling for the weights. As this is purely a matter of how weights are selected rather than a change in format, it increases accuracy compared to regular NVFP4 without impacting performance or requiring any changes at runtime.
 
 
24
  - `lm_head`, `embed_tokens`, layernorms: left in BF16. These are also left alone by regular NVFP4, since they're relatively small and very sensitive to quantization error.
25
 
26
  ### More about Four Over Six
 
10
 
11
  # Mistral-Nemo-Instruct-2407-NVFP4-FP8
12
 
 
 
 
 
 
13
  A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
14
 
15
+ This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
16
+
17
  ## Quantization Format
18
  The quant uses a mixed-precision approach, with different types of layers undergoing different amounts of quantization.
19
  - Self-attention layers: `FP8_DYNAMIC`, a format that uses FP8 weights and activations, with per-channel BF16 scales for the weights and dynamic per-token scales for the activations. Self-attention tensors comprise slightly less than 20 percent of the weights in the linear layers, meaning that upgrading them from NVFP4 to FP8 has a very limited effect on model size and VRAM use, and the nature of attention means that as context length increases, the benefits of doing attention calculations in w8a8 rather than w4a4 become more pronounced.
20
+ - MLP layers: NVFP4 using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010). As this is purely a matter of how weights are selected rather than a change in format, it increases accuracy compared to regular NVFP4 without impacting performance or requiring any changes at runtime.
21
+ - **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
22
+ - **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
23
  - `lm_head`, `embed_tokens`, layernorms: left in BF16. These are also left alone by regular NVFP4, since they're relatively small and very sensitive to quantization error.
24
 
25
  ### More about Four Over Six