Instructions to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
model = AutoModelForCausalLM.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8

SGLang

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Docker Model Runner:
```
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
```

DataSnake commited on 11 days ago

Commit

9c6a437

verified ·

1 Parent(s): 84657dd

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -6

README.md CHANGED Viewed

@@ -10,17 +10,16 @@ tags:
 # Mistral-Nemo-Instruct-2407-NVFP4-FP8
-This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010).
-- **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
-- **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
 A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
 ## Quantization Format
 The quant uses a mixed-precision approach, with different types of layers undergoing different amounts of quantization.
 - Self-attention layers: `FP8_DYNAMIC`, a format that uses FP8 weights and activations, with per-channel BF16 scales for the weights and dynamic per-token scales for the activations. Self-attention tensors comprise slightly less than 20 percent of the weights in the linear layers, meaning that upgrading them from NVFP4 to FP8 has a very limited effect on model size and VRAM use, and the nature of attention means that as context length increases, the benefits of doing attention calculations in w8a8 rather than w4a4 become more pronounced.
-- MLP layers: NVFP4 with [Four Over Six](https://arxiv.org/abs/2512.02010) adaptive block scaling for the weights. As this is purely a matter of how weights are selected rather than a change in format, it increases accuracy compared to regular NVFP4 without impacting performance or requiring any changes at runtime.
 - `lm_head`, `embed_tokens`, layernorms: left in BF16. These are also left alone by regular NVFP4, since they're relatively small and very sensitive to quantization error.
 ### More about Four Over Six

 # Mistral-Nemo-Instruct-2407-NVFP4-FP8
 A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
+This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
 ## Quantization Format
 The quant uses a mixed-precision approach, with different types of layers undergoing different amounts of quantization.
 - Self-attention layers: `FP8_DYNAMIC`, a format that uses FP8 weights and activations, with per-channel BF16 scales for the weights and dynamic per-token scales for the activations. Self-attention tensors comprise slightly less than 20 percent of the weights in the linear layers, meaning that upgrading them from NVFP4 to FP8 has a very limited effect on model size and VRAM use, and the nature of attention means that as context length increases, the benefits of doing attention calculations in w8a8 rather than w4a4 become more pronounced.
+- MLP layers: NVFP4 using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010). As this is purely a matter of how weights are selected rather than a change in format, it increases accuracy compared to regular NVFP4 without impacting performance or requiring any changes at runtime.
+  - **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
+  - **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
 - `lm_head`, `embed_tokens`, layernorms: left in BF16. These are also left alone by regular NVFP4, since they're relatively small and very sensitive to quantization error.
 ### More about Four Over Six