Instructions to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
model = AutoModelForCausalLM.from_pretrained("DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8

SGLang

How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8 with Docker Model Runner:
```
docker model run hf.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8
```

DataSnake commited on Apr 8

Commit

b17b8fb

verified ·

1 Parent(s): fa2b71b

Update README.md

Browse files

Files changed (1) hide show

README.md +9 -5

README.md CHANGED Viewed

@@ -84,6 +84,15 @@ As shown in the following graph, the difference in speed shrinks as context leng
 ## Long-context Perplexity
 For this test, I split sample texts into \\(n\\)-token chunks and computed perplexity scores for each chunk in all four quantized models: the NVFP4 baseline, one with Four Over Six weight selection, one with the attention tensors in FP8, and the hybrid format. I then recorded the average perplexity score for each model at each chunk size. Sample texts for this step were UTF-8 text files taken from Project Gutenberg, listed below.
 |Tokens|NVFP4|Four Over Six|FP8 Attention|Hybrid|
 |-:|-:|-:|-:|-:|
 |4096|4.2980|4.1049|3.7679|3.6271|
@@ -98,11 +107,6 @@ For this test, I split sample texts into \\(n\\)-token chunks and computed perpl
 While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
 ![image/png](perplexity-plot.png)
-### Sample texts used
-- [Pride and Prejudice](https://www.gutenberg.org/cache/epub/42671/pg42671.txt)
-- [Frankenstein](https://www.gutenberg.org/cache/epub/84/pg84.txt)
-- [Wuthering Heights](https://www.gutenberg.org/cache/epub/768/pg768.txt)
-- [Dracula](https://www.gutenberg.org/cache/epub/345/pg345.txt)
 ## Inference
 This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine). If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file `aphrodite/platforms/interface.py` in your library or venv (if you've followed the [official installation instructions](https://aphrodite.pygmalion.chat/installation/installation/), it will be under `~/venv/aphrodite/lib/python3.12/site-packages`) and comment out or delete lines 487-491.

 ## Long-context Perplexity
 For this test, I split sample texts into \\(n\\)-token chunks and computed perplexity scores for each chunk in all four quantized models: the NVFP4 baseline, one with Four Over Six weight selection, one with the attention tensors in FP8, and the hybrid format. I then recorded the average perplexity score for each model at each chunk size. Sample texts for this step were UTF-8 text files taken from Project Gutenberg, listed below.
+<details>
+<summary>Sample texts used</summary>
+- [Pride and Prejudice](https://www.gutenberg.org/cache/epub/42671/pg42671.txt)
+- [Frankenstein](https://www.gutenberg.org/cache/epub/84/pg84.txt)
+- [Wuthering Heights](https://www.gutenberg.org/cache/epub/768/pg768.txt)
+- [Dracula](https://www.gutenberg.org/cache/epub/345/pg345.txt)
+</details>
 |Tokens|NVFP4|Four Over Six|FP8 Attention|Hybrid|
 |-:|-:|-:|-:|-:|
 |4096|4.2980|4.1049|3.7679|3.6271|
 While perplexity for all quants increases with context length past 8192, the chart is very different from the one for performance and rather informative. Changing between Four Over Six and the default NVFP4 weight selection was a linear change in both the pure NVFP4 model and the one with FP8 attention. The two models with FP8 attention diverge from the two without as context length changes, however, indicating that as the number of tokens attending to each other increases, the benefits of doing attention calculations in higher precision become more pronounced.
 ![image/png](perplexity-plot.png)
 ## Inference
 This model requires compressed-tensors 0.14.0 or later and has been tested on both [vLLM](https://github.com/vllm-project/vllm) and [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine). If using Aphrodite Engine or an older version of vLLM, you'll need to manually update compressed-tensors to 0.14.0 or later, and if you're using Aphrodite Engine 0.10.0 rather than installing it from the latest github commit, you may need to open the file `aphrodite/platforms/interface.py` in your library or venv (if you've followed the [official installation instructions](https://aphrodite.pygmalion.chat/installation/installation/), it will be under `~/venv/aphrodite/lib/python3.12/site-packages`) and comment out or delete lines 487-491.