Text Generation
Transformers
Safetensors
mistral
nvfp4
conversational
text-generation-inference
8-bit precision
compressed-tensors
Instructions to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") model = AutoModelForCausalLM.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
- SGLang
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Docker Model Runner:
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
Update README.md
Browse files
README.md
CHANGED
|
@@ -10,17 +10,16 @@ tags:
|
|
| 10 |
|
| 11 |
# Mistral-Nemo-Instruct-2407-NVFP4-FP8
|
| 12 |
|
| 13 |
-
This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010).
|
| 14 |
-
|
| 15 |
-
- **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
|
| 16 |
-
- **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
|
| 17 |
-
|
| 18 |
A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
|
| 19 |
|
|
|
|
|
|
|
| 20 |
## Quantization Format
|
| 21 |
The quant uses a mixed-precision approach, with different types of layers undergoing different amounts of quantization.
|
| 22 |
- Self-attention layers: `FP8_DYNAMIC`, a format that uses FP8 weights and activations, with per-channel BF16 scales for the weights and dynamic per-token scales for the activations. Self-attention tensors comprise slightly less than 20 percent of the weights in the linear layers, meaning that upgrading them from NVFP4 to FP8 has a very limited effect on model size and VRAM use, and the nature of attention means that as context length increases, the benefits of doing attention calculations in w8a8 rather than w4a4 become more pronounced.
|
| 23 |
-
- MLP layers: NVFP4
|
|
|
|
|
|
|
| 24 |
- `lm_head`, `embed_tokens`, layernorms: left in BF16. These are also left alone by regular NVFP4, since they're relatively small and very sensitive to quantization error.
|
| 25 |
|
| 26 |
### More about Four Over Six
|
|
|
|
| 10 |
|
| 11 |
# Mistral-Nemo-Instruct-2407-NVFP4-FP8
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
A version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407) created with llm-compressor 0.10.0 and compressed-tensors 0.14.0, mainly to test out a hybrid quantization format. The goal was to improve accuracy compared to regular NVFP4 with minimal impact on speed and VRAM usage, with the specific goal of remaining small enough to support a 32k-token context window in Aphrodite Engine on a RTX 5060 Ti 16GB.
|
| 14 |
|
| 15 |
+
This repository contains a quantized version of [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
|
| 16 |
+
|
| 17 |
## Quantization Format
|
| 18 |
The quant uses a mixed-precision approach, with different types of layers undergoing different amounts of quantization.
|
| 19 |
- Self-attention layers: `FP8_DYNAMIC`, a format that uses FP8 weights and activations, with per-channel BF16 scales for the weights and dynamic per-token scales for the activations. Self-attention tensors comprise slightly less than 20 percent of the weights in the linear layers, meaning that upgrading them from NVFP4 to FP8 has a very limited effect on model size and VRAM use, and the nature of attention means that as context length increases, the benefits of doing attention calculations in w8a8 rather than w4a4 become more pronounced.
|
| 20 |
+
- MLP layers: NVFP4 using the **Four Over Six (4/6)** quantization method described in the paper [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://huggingface.co/papers/2512.02010). As this is purely a matter of how weights are selected rather than a change in format, it increases accuracy compared to regular NVFP4 without impacting performance or requiring any changes at runtime.
|
| 21 |
+
- **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
|
| 22 |
+
- **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
|
| 23 |
- `lm_head`, `embed_tokens`, layernorms: left in BF16. These are also left alone by regular NVFP4, since they're relatively small and very sensitive to quantization error.
|
| 24 |
|
| 25 |
### More about Four Over Six
|