Instructions to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="DataSnake/Wayfarer-2-12B-NVFP4-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DataSnake/Wayfarer-2-12B-NVFP4-FP8")
model = AutoModelForCausalLM.from_pretrained("DataSnake/Wayfarer-2-12B-NVFP4-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DataSnake/Wayfarer-2-12B-NVFP4-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Wayfarer-2-12B-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DataSnake/Wayfarer-2-12B-NVFP4-FP8

SGLang

How to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "DataSnake/Wayfarer-2-12B-NVFP4-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Wayfarer-2-12B-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "DataSnake/Wayfarer-2-12B-NVFP4-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DataSnake/Wayfarer-2-12B-NVFP4-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with Docker Model Runner:
```
docker model run hf.co/DataSnake/Wayfarer-2-12B-NVFP4-FP8
```

nielsr HF Staff commited on 28 days ago

Commit

2bb284b

verified ·

1 Parent(s): 0b21ab2

Add library name, code link and citation

Browse files

This PR improves the model card by:
- Adding `library_name: transformers` to the metadata to enable the "Use in Transformers" button.
- Adding a link to the official GitHub repository for the [Four Over Six](https://github.com/mit-han-lab/fouroversix) quantization method.
- Adding the BibTeX citation for the research paper.

Files changed (1) hide show

README.md +26 -7

README.md CHANGED Viewed

@@ -1,24 +1,29 @@
 ---
-license: apache-2.0
-language:
-- en
 base_model:
 - LatitudeGames/Wayfarer-2-12B
 tags:
 - text adventure
 - roleplay
 - nvfp4
 model_size: 12B
-datasets:
-- zerofata/Roleplay-Anime-Characters
-pipeline_tag: text-generation
 ---
 ![image/jpeg](Wayfarer-2-12B.jpg)
 # Wayfarer-2-12B-NVFP4-FP8
 Quantized weights of the [Wayfarer-2-12B](https://huggingface.co/LatitudeGames/Wayfarer-2-12B) model for use with nVidia Blackwell GPUs, in a hybrid format using NVFP4 with [Four Over Six](https://arxiv.org/abs/2512.02010) adaptive block scaling for the MLP layers and `FP8_DYNAMIC` for the self-attention layers. More information about the hybrid format [here](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8), but the short version is that FP8 attention has minimal impact on speed and VRAM usage while making a marked difference in output quality, especially at longer context lengths.
 ## Inference
 Tested on a RTX 5060 Ti 16GB with [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine) and [vLLM](https://github.com/vllm-project/vllm). It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the `--single-user-mode` flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with `--max-num-seqs 1 --cudagraph-capture-sizes 2` flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.
 <details>
@@ -76,4 +81,18 @@ As such, I would recommend using that format for inference.
 Wayfarer-2-12B was made by [Latitude Games](https://huggingface.co/LatitudeGames) with help from [Gryphe Padar](https://huggingface.co/Gryphe)
-Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han

 ---
 base_model:
 - LatitudeGames/Wayfarer-2-12B
+datasets:
+- zerofata/Roleplay-Anime-Characters
+language:
+- en
+license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
 tags:
 - text adventure
 - roleplay
 - nvfp4
 model_size: 12B
 ---
 ![image/jpeg](Wayfarer-2-12B.jpg)
 # Wayfarer-2-12B-NVFP4-FP8
 Quantized weights of the [Wayfarer-2-12B](https://huggingface.co/LatitudeGames/Wayfarer-2-12B) model for use with nVidia Blackwell GPUs, in a hybrid format using NVFP4 with [Four Over Six](https://arxiv.org/abs/2512.02010) adaptive block scaling for the MLP layers and `FP8_DYNAMIC` for the self-attention layers. More information about the hybrid format [here](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8), but the short version is that FP8 attention has minimal impact on speed and VRAM usage while making a marked difference in output quality, especially at longer context lengths.
+- **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
+- **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
 ## Inference
 Tested on a RTX 5060 Ti 16GB with [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine) and [vLLM](https://github.com/vllm-project/vllm). It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the `--single-user-mode` flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with `--max-num-seqs 1 --cudagraph-capture-sizes 2` flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.
 <details>
 Wayfarer-2-12B was made by [Latitude Games](https://huggingface.co/LatitudeGames) with help from [Gryphe Padar](https://huggingface.co/Gryphe)
+Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
+## Citation
+```bibtex
+@misc{cook2025sixaccuratenvfp4quantization,
+      title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
+      author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
+      year={2025},
+      eprint={2512.02010},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2512.02010},
+}
+```