Text Generation
Transformers
Safetensors
mistral
nvfp4
conversational
text-generation-inference
8-bit precision
compressed-tensors
Instructions to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") model = AutoModelForCausalLM.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
- SGLang
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Docker Model Runner:
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
Update README.md
Browse files
README.md
CHANGED
|
@@ -84,6 +84,15 @@ As shown in the following graph, the difference in speed shrinks as context leng
|
|
| 84 |
## Long-context Perplexity
|
| 85 |
For this test, I split sample texts into \\(n\\)-token chunks and computed perplexity scores for each chunk in all four quantized models: the NVFP4 baseline, one with Four Over Six weight selection, one with the attention tensors in FP8, and the hybrid format. I then recorded the average perplexity score for each model at each chunk size. Sample texts for this step were UTF-8 text files taken from Project Gutenberg, listed below.
|
| 86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|Tokens|NVFP4|Four Over Six|FP8 Attention|Hybrid|
|
| 88 |
|-:|-:|-:|-:|-:|
|
| 89 |
|4096|4.2980|4.1049|3.7679|3.6271|
|
|
@@ -98,11 +107,6 @@ For this test, I split sample texts into \\(n\\)-token chunks and computed perpl
|
|
| 98 |
While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
|
| 99 |

|
| 100 |
|
| 101 |
-
### Sample texts used
|
| 102 |
-
- [Pride and Prejudice](https://www.gutenberg.org/cache/epub/42671/pg42671.txt)
|
| 103 |
-
- [Frankenstein](https://www.gutenberg.org/cache/epub/84/pg84.txt)
|
| 104 |
-
- [Wuthering Heights](https://www.gutenberg.org/cache/epub/768/pg768.txt)
|
| 105 |
-
- [Dracula](https://www.gutenberg.org/cache/epub/345/pg345.txt)
|
| 106 |
|
| 107 |
## Inference
|
| 108 |
This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine). If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file `aphrodite/platforms/interface.py` in your library or venv (if you've followed the [official installation instructions](https://aphrodite.pygmalion.chat/installation/installation/), it will be under `~/venv/aphrodite/lib/python3.12/site-packages`) and comment out or delete lines 487-491.
|
|
|
|
| 84 |
## Long-context Perplexity
|
| 85 |
For this test, I split sample texts into \\(n\\)-token chunks and computed perplexity scores for each chunk in all four quantized models: the NVFP4 baseline, one with Four Over Six weight selection, one with the attention tensors in FP8, and the hybrid format. I then recorded the average perplexity score for each model at each chunk size. Sample texts for this step were UTF-8 text files taken from Project Gutenberg, listed below.
|
| 86 |
|
| 87 |
+
<details>
|
| 88 |
+
<summary>Sample texts used</summary>
|
| 89 |
+
|
| 90 |
+
- [Pride and Prejudice](https://www.gutenberg.org/cache/epub/42671/pg42671.txt)
|
| 91 |
+
- [Frankenstein](https://www.gutenberg.org/cache/epub/84/pg84.txt)
|
| 92 |
+
- [Wuthering Heights](https://www.gutenberg.org/cache/epub/768/pg768.txt)
|
| 93 |
+
- [Dracula](https://www.gutenberg.org/cache/epub/345/pg345.txt)
|
| 94 |
+
</details>
|
| 95 |
+
|
| 96 |
|Tokens|NVFP4|Four Over Six|FP8 Attention|Hybrid|
|
| 97 |
|-:|-:|-:|-:|-:|
|
| 98 |
|4096|4.2980|4.1049|3.7679|3.6271|
|
|
|
|
| 107 |
While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
|
| 108 |

|
| 109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
## Inference
|
| 112 |
This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine). If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file `aphrodite/platforms/interface.py` in your library or venv (if you've followed the [official installation instructions](https://aphrodite.pygmalion.chat/installation/installation/), it will be under `~/venv/aphrodite/lib/python3.12/site-packages`) and comment out or delete lines 487-491.
|