Text Generation
Transformers
Safetensors
English
mistral
text adventure
roleplay
nvfp4
conversational
text-generation-inference
8-bit precision
compressed-tensors
Instructions to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DataSnake/Wayfarer-2-12B-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("DataSnake/Wayfarer-2-12B-NVFP4-FP8") model = AutoModelForCausalLM.from_pretrained("DataSnake/Wayfarer-2-12B-NVFP4-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DataSnake/Wayfarer-2-12B-NVFP4-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Wayfarer-2-12B-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DataSnake/Wayfarer-2-12B-NVFP4-FP8
- SGLang
How to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DataSnake/Wayfarer-2-12B-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Wayfarer-2-12B-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DataSnake/Wayfarer-2-12B-NVFP4-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DataSnake/Wayfarer-2-12B-NVFP4-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use DataSnake/Wayfarer-2-12B-NVFP4-FP8 with Docker Model Runner:
docker model run hf.co/DataSnake/Wayfarer-2-12B-NVFP4-FP8
Add library name, code link and citation
Browse filesThis PR improves the model card by:
- Adding `library_name: transformers` to the metadata to enable the "Use in Transformers" button.
- Adding a link to the official GitHub repository for the [Four Over Six](https://github.com/mit-han-lab/fouroversix) quantization method.
- Adding the BibTeX citation for the research paper.
README.md
CHANGED
|
@@ -1,24 +1,29 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
base_model:
|
| 6 |
- LatitudeGames/Wayfarer-2-12B
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
tags:
|
| 8 |
- text adventure
|
| 9 |
- roleplay
|
| 10 |
- nvfp4
|
| 11 |
model_size: 12B
|
| 12 |
-
datasets:
|
| 13 |
-
- zerofata/Roleplay-Anime-Characters
|
| 14 |
-
pipeline_tag: text-generation
|
| 15 |
---
|
|
|
|
| 16 |

|
| 17 |
|
| 18 |
# Wayfarer-2-12B-NVFP4-FP8
|
| 19 |
|
| 20 |
Quantized weights of the [Wayfarer-2-12B](https://huggingface.co/LatitudeGames/Wayfarer-2-12B) model for use with nVidia Blackwell GPUs, in a hybrid format using NVFP4 with [Four Over Six](https://arxiv.org/abs/2512.02010) adaptive block scaling for the MLP layers and `FP8_DYNAMIC` for the self-attention layers. More information about the hybrid format [here](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8), but the short version is that FP8 attention has minimal impact on speed and VRAM usage while making a marked difference in output quality, especially at longer context lengths.
|
| 21 |
|
|
|
|
|
|
|
|
|
|
| 22 |
## Inference
|
| 23 |
Tested on a RTX 5060 Ti 16GB with [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine) and [vLLM](https://github.com/vllm-project/vllm). It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the `--single-user-mode` flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with `--max-num-seqs 1 --cudagraph-capture-sizes 2` flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.
|
| 24 |
<details>
|
|
@@ -76,4 +81,18 @@ As such, I would recommend using that format for inference.
|
|
| 76 |
|
| 77 |
Wayfarer-2-12B was made by [Latitude Games](https://huggingface.co/LatitudeGames) with help from [Gryphe Padar](https://huggingface.co/Gryphe)
|
| 78 |
|
| 79 |
-
Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- LatitudeGames/Wayfarer-2-12B
|
| 4 |
+
datasets:
|
| 5 |
+
- zerofata/Roleplay-Anime-Characters
|
| 6 |
+
language:
|
| 7 |
+
- en
|
| 8 |
+
license: apache-2.0
|
| 9 |
+
pipeline_tag: text-generation
|
| 10 |
+
library_name: transformers
|
| 11 |
tags:
|
| 12 |
- text adventure
|
| 13 |
- roleplay
|
| 14 |
- nvfp4
|
| 15 |
model_size: 12B
|
|
|
|
|
|
|
|
|
|
| 16 |
---
|
| 17 |
+
|
| 18 |

|
| 19 |
|
| 20 |
# Wayfarer-2-12B-NVFP4-FP8
|
| 21 |
|
| 22 |
Quantized weights of the [Wayfarer-2-12B](https://huggingface.co/LatitudeGames/Wayfarer-2-12B) model for use with nVidia Blackwell GPUs, in a hybrid format using NVFP4 with [Four Over Six](https://arxiv.org/abs/2512.02010) adaptive block scaling for the MLP layers and `FP8_DYNAMIC` for the self-attention layers. More information about the hybrid format [here](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8), but the short version is that FP8 attention has minimal impact on speed and VRAM usage while making a marked difference in output quality, especially at longer context lengths.
|
| 23 |
|
| 24 |
+
- **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
|
| 25 |
+
- **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
|
| 26 |
+
|
| 27 |
## Inference
|
| 28 |
Tested on a RTX 5060 Ti 16GB with [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine) and [vLLM](https://github.com/vllm-project/vllm). It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the `--single-user-mode` flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with `--max-num-seqs 1 --cudagraph-capture-sizes 2` flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.
|
| 29 |
<details>
|
|
|
|
| 81 |
|
| 82 |
Wayfarer-2-12B was made by [Latitude Games](https://huggingface.co/LatitudeGames) with help from [Gryphe Padar](https://huggingface.co/Gryphe)
|
| 83 |
|
| 84 |
+
Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
|
| 85 |
+
|
| 86 |
+
## Citation
|
| 87 |
+
|
| 88 |
+
```bibtex
|
| 89 |
+
@misc{cook2025sixaccuratenvfp4quantization,
|
| 90 |
+
title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
|
| 91 |
+
author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
|
| 92 |
+
year={2025},
|
| 93 |
+
eprint={2512.02010},
|
| 94 |
+
archivePrefix={arXiv},
|
| 95 |
+
primaryClass={cs.CL},
|
| 96 |
+
url={https://arxiv.org/abs/2512.02010},
|
| 97 |
+
}
|
| 98 |
+
```
|