Instructions to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
model = AutoModelForCausalLM.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8

SGLang

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Docker Model Runner:
```
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
```

Improve model card metadata and add paper/code links

by nielsr HF Staff - opened 18 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+26

-25

Files changed (1) hide show

README.md +26 -25

README.md CHANGED Viewed

@@ -1,12 +1,20 @@
 ---
-license: apache-2.0
 base_model:
 - mistralai/Mistral-Nemo-Instruct-2407
 tags:
 - nvfp4
 ---
 # Mistral-Nemo-Instruct-2407-NVFP4-FP8
 A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
 ## Quantization Format
@@ -21,7 +29,7 @@ One of the main downsides of using FP4 is the extreme sparsity of large values.
 ![image/png](four-over-six.png)
 However, while scaling to ±4 reduces worst-case rounding error for large values, it increases rounding error for smaller values, so simply scaling every block to ±4 would be a bad idea. The solution is to try scaling each block both ways, then keep whichever gives the lowest quantization MSE for that block. The `memoryless_mse` observer in llm-compressor is designed to work on a similar principle, calculating scale factors as though the weights were multiplied by different values of \\(p\\) and choosing the scale that minimizes quantization error for each block. While this is primarily intended for using \\(p\le1\\) to allow extra precision for small values at the cost of clipping large values, when used with NVFP4 it's mathematically equivalent to mapping the most extreme values in each block to \\(±6/p\\). Obviously, this can be used to implement Four Over Six by setting \\(p\in\{1,1.5\}\\). The key to doing this is the following code from `mse.py`:
-```
 for i in range(int(maxshrink * grid)):
     p = 1 - i / grid
 ```
@@ -107,31 +115,24 @@ For this test, I split sample texts into \\(n\\)-token chunks and computed perpl
 While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
 ![image/png](perplexity-plot.png)
-### Further Perplexity Comparison
-Out of curiosity, I also tried quantizing the model with a different mixed-precision recipe that quantized all `down_proj` tensors to `FP8_DYNAMIC` and the rest to NVFP4, testing versions [with](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-Down-4over6) and [without](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-Down-RTN) Four Over Six. Interestingly, while these performed better than any other at shorter context lengths, their graphs remained parallel to that of pure NVFP4 and both were overtaken by the versions with FP8 attention at longer contexts. Between this and the fact that the versions with FP8 `down_proj` were larger and thus required more VRAM, I feel confident in my assessment that FP8 attention is the better option overall.
-<details>
-<summary>Results</summary>
-|Tokens|FP8 `down_proj`|FP8 `down_proj` (4/6)|
-|-:|-:|-:|
-|4096|3.5965|3.4747|
-|8192|3.4717|3.3517|
-|12288|3.7064|3.5865|
-|16384|4.0343|3.9131|
-|20480|4.2567|4.1288|
-|24576|4.4232|4.2880|
-|28672|4.6076|4.4737|
-|32768|4.7801|4.6277|
-![image/png](perplexity-all.png)
-</details>
 ## Inference
-This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine). If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file `aphrodite/platforms/interface.py` in your library or venv (if you've followed the [official installation instructions](https://aphrodite.pygmalion.chat/installation/installation/), it will be under `~/venv/aphrodite/lib/python3.12/site-packages`) and comment out or delete lines 487-491.
 ## Credits
 Mistral-Nemo-Instruct-2407 was made by [Mistral AI](https://huggingface.co/mistralai) and [nVidia](https://huggingface.co/nvidia)
-Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han

 ---
 base_model:
 - mistralai/Mistral-Nemo-Instruct-2407
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
 tags:
 - nvfp4
 ---
 # Mistral-Nemo-Instruct-2407-NVFP4-FP8
+This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010).
+- **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
+- **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
 A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
 ## Quantization Format
 ![image/png](four-over-six.png)
 However, while scaling to ±4 reduces worst-case rounding error for large values, it increases rounding error for smaller values, so simply scaling every block to ±4 would be a bad idea. The solution is to try scaling each block both ways, then keep whichever gives the lowest quantization MSE for that block. The `memoryless_mse` observer in llm-compressor is designed to work on a similar principle, calculating scale factors as though the weights were multiplied by different values of \\(p\\) and choosing the scale that minimizes quantization error for each block. While this is primarily intended for using \\(p\le1\\) to allow extra precision for small values at the cost of clipping large values, when used with NVFP4 it's mathematically equivalent to mapping the most extreme values in each block to \\(±6/p\\). Obviously, this can be used to implement Four Over Six by setting \\(p\in\{1,1.5\}\\). The key to doing this is the following code from `mse.py`:
+```python
 for i in range(int(maxshrink * grid)):
     p = 1 - i / grid
 ```
 While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
 ![image/png](perplexity-plot.png)
 ## Inference
+This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine).
 ## Credits
 Mistral-Nemo-Instruct-2407 was made by [Mistral AI](https://huggingface.co/mistralai) and [nVidia](https://huggingface.co/nvidia)
+Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
+## Citation
+```bibtex
+@misc{cook2025sixaccuratenvfp4quantization,
+      title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
+      author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
+      year={2025},
+      eprint={2512.02010},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2512.02010},
+}
+```