nielsr HF Staff commited on
Commit
2bb284b
·
verified ·
1 Parent(s): 0b21ab2

Add library name, code link and citation

Browse files

This PR improves the model card by:
- Adding `library_name: transformers` to the metadata to enable the "Use in Transformers" button.
- Adding a link to the official GitHub repository for the [Four Over Six](https://github.com/mit-han-lab/fouroversix) quantization method.
- Adding the BibTeX citation for the research paper.

Files changed (1) hide show
  1. README.md +26 -7
README.md CHANGED
@@ -1,24 +1,29 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - LatitudeGames/Wayfarer-2-12B
 
 
 
 
 
 
 
7
  tags:
8
  - text adventure
9
  - roleplay
10
  - nvfp4
11
  model_size: 12B
12
- datasets:
13
- - zerofata/Roleplay-Anime-Characters
14
- pipeline_tag: text-generation
15
  ---
 
16
  ![image/jpeg](Wayfarer-2-12B.jpg)
17
 
18
  # Wayfarer-2-12B-NVFP4-FP8
19
 
20
  Quantized weights of the [Wayfarer-2-12B](https://huggingface.co/LatitudeGames/Wayfarer-2-12B) model for use with nVidia Blackwell GPUs, in a hybrid format using NVFP4 with [Four Over Six](https://arxiv.org/abs/2512.02010) adaptive block scaling for the MLP layers and `FP8_DYNAMIC` for the self-attention layers. More information about the hybrid format [here](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8), but the short version is that FP8 attention has minimal impact on speed and VRAM usage while making a marked difference in output quality, especially at longer context lengths.
21
 
 
 
 
22
  ## Inference
23
  Tested on a RTX 5060 Ti 16GB with [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine) and [vLLM](https://github.com/vllm-project/vllm). It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the `--single-user-mode` flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with `--max-num-seqs 1 --cudagraph-capture-sizes 2` flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.
24
  <details>
@@ -76,4 +81,18 @@ As such, I would recommend using that format for inference.
76
 
77
  Wayfarer-2-12B was made by [Latitude Games](https://huggingface.co/LatitudeGames) with help from [Gryphe Padar](https://huggingface.co/Gryphe)
78
 
79
- Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - LatitudeGames/Wayfarer-2-12B
4
+ datasets:
5
+ - zerofata/Roleplay-Anime-Characters
6
+ language:
7
+ - en
8
+ license: apache-2.0
9
+ pipeline_tag: text-generation
10
+ library_name: transformers
11
  tags:
12
  - text adventure
13
  - roleplay
14
  - nvfp4
15
  model_size: 12B
 
 
 
16
  ---
17
+
18
  ![image/jpeg](Wayfarer-2-12B.jpg)
19
 
20
  # Wayfarer-2-12B-NVFP4-FP8
21
 
22
  Quantized weights of the [Wayfarer-2-12B](https://huggingface.co/LatitudeGames/Wayfarer-2-12B) model for use with nVidia Blackwell GPUs, in a hybrid format using NVFP4 with [Four Over Six](https://arxiv.org/abs/2512.02010) adaptive block scaling for the MLP layers and `FP8_DYNAMIC` for the self-attention layers. More information about the hybrid format [here](https://huggingface.co/DataSnake/Mistral-Nemo-Instruct-2407-NVFP4-FP8), but the short version is that FP8 attention has minimal impact on speed and VRAM usage while making a marked difference in output quality, especially at longer context lengths.
23
 
24
+ - **Paper:** [Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling](https://arxiv.org/abs/2512.02010)
25
+ - **Code:** [https://github.com/mit-han-lab/fouroversix](https://github.com/mit-han-lab/fouroversix)
26
+
27
  ## Inference
28
  Tested on a RTX 5060 Ti 16GB with [Aphrodite Engine](https://github.com/aphrodite-engine/aphrodite-engine) and [vLLM](https://github.com/vllm-project/vllm). It requires compressed-tensors 0.14.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine 0.10.0 or an older version of vLLM. On my system, Aphrodite Engine 0.10.0 was able to run the checkpoint with a 32k context window with the `--single-user-mode` flag, while vLLM 0.20.0 and Aphrodite Engine 0.20.0, which don't have that flag, were able to do the same with `--max-num-seqs 1 --cudagraph-capture-sizes 2` flags, though with the caveat that each crashed with OOM errors the first time they ran the model but ran fine from the second time onwards.
29
  <details>
 
81
 
82
  Wayfarer-2-12B was made by [Latitude Games](https://huggingface.co/LatitudeGames) with help from [Gryphe Padar](https://huggingface.co/Gryphe)
83
 
84
+ Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
85
+
86
+ ## Citation
87
+
88
+ ```bibtex
89
+ @misc{cook2025sixaccuratenvfp4quantization,
90
+ title={Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling},
91
+ author={Jack Cook and Junxian Guo and Guangxuan Xiao and Yujun Lin and Song Han},
92
+ year={2025},
93
+ eprint={2512.02010},
94
+ archivePrefix={arXiv},
95
+ primaryClass={cs.CL},
96
+ url={https://arxiv.org/abs/2512.02010},
97
+ }
98
+ ```