Text Generation
Transformers
Safetensors
mistral
nvfp4
conversational
text-generation-inference
8-bit precision
compressed-tensors
Instructions to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") model = AutoModelForCausalLM.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
- SGLang
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Docker Model Runner:
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
Improve model card metadata and add paper/code links
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,12 +1,20 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
- mistralai/Mistral-Nemo-Instruct-2407
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- nvfp4
|
| 7 |
---
|
| 8 |
|
| 9 |
# Mistral-Nemo-Instruct-2407-NVFP4-FP8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
|
| 11 |
|
| 12 |
## Quantization Format
|
|
@@ -21,7 +29,7 @@ One of the main downsides of using FP4 is the extreme sparsity of large values.
|
|
| 21 |

|
| 22 |
|
| 23 |
However, while scaling to ±4 reduces worst-case rounding error for large values, it increases rounding error for smaller values, so simply scaling every block to ±4 would be a bad idea. The solution is to try scaling each block both ways, then keep whichever gives the lowest quantization MSE for that block. The `memoryless_mse` observer in llm-compressor is designed to work on a similar principle, calculating scale factors as though the weights were multiplied by different values of \\(p\\) and choosing the scale that minimizes quantization error for each block. While this is primarily intended for using \\(p\le1\\) to allow extra precision for small values at the cost of clipping large values, when used with NVFP4 it's mathematically equivalent to mapping the most extreme values in each block to \\(±6/p\\). Obviously, this can be used to implement Four Over Six by setting \\(p\in\{1,1.5\}\\). The key to doing this is the following code from `mse.py`:
|
| 24 |
-
```
|
| 25 |
for i in range(int(maxshrink * grid)):
|
| 26 |
p = 1 - i / grid
|
| 27 |
```
|
|
@@ -107,31 +115,24 @@ For this test, I split sample texts into \\(n\\)-token chunks and computed perpl
|
|
| 107 |
While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
|
| 108 |

|
| 109 |
|
| 110 |
-
### Further Perplexity Comparison
|
| 111 |
-
|
| 112 |
-
Out of curiosity, I also tried quantizing the model with a different mixed-precision recipe that quantized all `down_proj` tensors to `FP8_DYNAMIC` and the rest to NVFP4, testing versions [with](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-Down-4over6) and [without](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-Down-RTN) Four Over Six. Interestingly, while these performed better than any other at shorter context lengths, their graphs remained parallel to that of pure NVFP4 and both were overtaken by the versions with FP8 attention at longer contexts. Between this and the fact that the versions with FP8 `down_proj` were larger and thus required more VRAM, I feel confident in my assessment that FP8 attention is the better option overall.
|
| 113 |
-
|
| 114 |
-
<details>
|
| 115 |
-
<summary>Results</summary>
|
| 116 |
-
|
| 117 |
-
|Tokens|FP8 `down_proj`|FP8 `down_proj` (4/6)|
|
| 118 |
-
|-:|-:|-:|
|
| 119 |
-
|4096|3.5965|3.4747|
|
| 120 |
-
|8192|3.4717|3.3517|
|
| 121 |
-
|12288|3.7064|3.5865|
|
| 122 |
-
|16384|4.0343|3.9131|
|
| 123 |
-
|20480|4.2567|4.1288|
|
| 124 |
-
|24576|4.4232|4.2880|
|
| 125 |
-
|28672|4.6076|4.4737|
|
| 126 |
-
|32768|4.7801|4.6277|
|
| 127 |
-
|
| 128 |
-

|
| 129 |
-
</details>
|
| 130 |
-
|
| 131 |
## Inference
|
| 132 |
-
This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine).
|
| 133 |
|
| 134 |
## Credits
|
| 135 |
Mistral-Nemo-Instruct-2407 was made by [Mistral AI](https://huggingface.co/mistralai) and [nVidia](https://huggingface.co/nvidia)
|
| 136 |
|
| 137 |
-
Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model:
|
| 3 |
- mistralai/Mistral-Nemo-Instruct-2407
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
library_name: transformers
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
tags:
|
| 8 |
- nvfp4
|
| 9 |
---
|
| 10 |
|
| 11 |
# Mistral-Nemo-Instruct-2407-NVFP4-FP8
|
| 12 |
+
|
| 13 |
+
This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010).
|
| 14 |
+
|
| 15 |
+
- **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
|
| 16 |
+
- **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
|
| 17 |
+
|
| 18 |
A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
|
| 19 |
|
| 20 |
## Quantization Format
|
|
|
|
| 29 |

|
| 30 |
|
| 31 |
However, while scaling to ±4 reduces worst-case rounding error for large values, it increases rounding error for smaller values, so simply scaling every block to ±4 would be a bad idea. The solution is to try scaling each block both ways, then keep whichever gives the lowest quantization MSE for that block. The `memoryless_mse` observer in llm-compressor is designed to work on a similar principle, calculating scale factors as though the weights were multiplied by different values of \\(p\\) and choosing the scale that minimizes quantization error for each block. While this is primarily intended for using \\(p\le1\\) to allow extra precision for small values at the cost of clipping large values, when used with NVFP4 it's mathematically equivalent to mapping the most extreme values in each block to \\(±6/p\\). Obviously, this can be used to implement Four Over Six by setting \\(p\in\{1,1.5\}\\). The key to doing this is the following code from `mse.py`:
|
| 32 |
+
```python
|
| 33 |
for i in range(int(maxshrink * grid)):
|
| 34 |
p = 1 - i / grid
|
| 35 |
```
|
|
|
|
| 115 |
While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
|
| 116 |

|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
## Inference
|
| 119 |
+
This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine).
|
| 120 |
|
| 121 |
## Credits
|
| 122 |
Mistral-Nemo-Instruct-2407 was made by [Mistral AI](https://huggingface.co/mistralai) and [nVidia](https://huggingface.co/nvidia)
|
| 123 |
|
| 124 |
+
Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
|
| 125 |
+
|
| 126 |
+
## Citation
|
| 127 |
+
|
| 128 |
+
```bibtex
|
| 129 |
+
@misc{cook2025sixaccuratenvfp4quantization,
|
| 130 |
+
title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
|
| 131 |
+
author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
|
| 132 |
+
year={2025},
|
| 133 |
+
eprint={2512.02010},
|
| 134 |
+
archivePrefix={arXiv},
|
| 135 |
+
primaryClass={cs.CL},
|
| 136 |
+
url={https://arxiv.org/abs/2512.02010},
|
| 137 |
+
}
|
| 138 |
+
```
|