Wayfarer-2-12B-NVFP4-4over6
Quantized NVFP4 weights of the Wayfarer-2-12B model, for use with nVidia Blackwell GPUs.
Quantization details
Quantized with llm-compressor 0.9.0.2 using Four Over Six adaptive block scaling with MSE selection for the weights, done by using the memoryless_mse observer with maxshrink and grid set to negative numbers. Calibrated with the Roleplay-Anime-Characters dataset, formatted in ChatML format.
A Brief Overview of Four Over Six
One of the main downsides of using FP4 is the extreme sparsity of large values. At a base level, NVFP4 works by dividing the model into sixteen-element blocks, then assigning FP8 scale factors to each block (as well as a single FP32 scale factor for the tensor as a whole) such that the largest absolute value in the block maps to ±6. For example, if a block has the values {10,-20,40,-60}, the scale factor would be set to 10 and the FP4 values would be {1,-2,4,-6}. The problem is that the FP4 format only allows for a very limited set of values. In particular, it can't represent any number between 4 and 6, so anything in the block that maps to ±5 will be severely affected by rounding error. Changing the scale factors so that the maximum value for the block maps to ±4 reduces the maximum possible error introduced by this type of rounding, but it also increases the rounding error in smaller values, so it's not a good idea to just quantize the entire model with the maximum value set to 4. In the following two graphs, the x-axis represents the value a given weight would have with the corresponding scaling, and the y-axis represents how much proportional error would be introduced by rounding it to the nearest valid FP4 value:

My Implementation
I discovered that the memoryless-mse observer used by llm-compressor can be used to emulate Four Over Six with the following settings (full recipe in recipe.yaml):
maxshrink: -1.0
grid: -2.0
norm: 2.0
The memoryless_mse observer works by trying different values of p to multiply the minimum and maximum weights in each block by, then calculating the scale factors from those and comparing the MSE of the original and quantized weights for each value of p and selecting the value that performs best. This is mainly used with p≤1, decreasing the scaling factors to increase the precision of the representation of smaller values at the cost of introducing clipping to larger values by making it so that they map to values above 6 or below -6. More specifically, the most extreme value in each block maps to ±6/p. Here's a graph of proportional rounding error in NVFP4 with p=0.8, which would cause the largest value in the block to map to 7.5:
This is basically the same tradeoff as Four Over Six, but in the opposite direction: increasing the worst-case error to decrease rounding error for smaller values. As such, it should come as no surprise that the same observer can be repurposed to implement Four Over Six. The key was these lines in mse.py:
for i in range(int(maxshrink * grid)):
p = 1 - i / grid
With maxshrink set to -1 and grid to -2, this for loop runs twice: first with p=1.0, then with p=1.5. Since the observer tests scales that map the most extreme value in each block to ±6/p, trying with p=1.5 results in scaling them to ±4.
I only did this for weights, calibrating the activations with the default static_minmax observer. Since the only activation-related values stored in the model are the per-tensor FP32 scale factors, stretching them to accommodate Four Over Six would be counterproductive because it won't be used at runtime anyway.
Inference
Tested on a RTX 5060 Ti 16GB with Aphrodite Engine and vLLM, but the fact that Four Over Six is just a different way of selecting scale factors rather than a different format means this model should run fine with any backend that supports the compressed-tensors NVFP4 format. It requires compressed-tensors 0.13.0 or later, so you'll have to update the version in your venv if you use Aphrodite Engine or an older version of vLLM. On my system, Aphrodite Engine was able to run the checkpoint with a 32k context window with the --single-user-mode flag, while vLLM didn't have quite enough VRAM to do the same. It works fine at shorter context lengths or with the KV cache quantized, however.
Recommended generation settings (a mix of what it says on the Wayfarer-2-12B model card and the AI Dungeon Model Guide entry for Wayfarer 2):
- Temperature: 1.1
- Top K: 300
- Top P: 0.85
- Min P: 0.025
- Repetition Penalty: 1.05
- Presence Penalty: 0.5
- Frequency Penalty: 0.2
If using programs that support DRY and XTC (at time of writing, Aphrodite Engine supports both and vLLM doesn't support either yet), you can also try using them to cut down on repetition without the need to set temperature, presence penalty, and frequency penalty quite so high.
Prompt Format
As mentioned above, the calibration data was provided with the same ChatML tags as had been used to finetune Latitude's 12B models:
<|im_start|>system
You're a masterful storyteller and gamemaster. Write in second person present tense (You are), crafting vivid, engaging narratives with authority and confidence.<|im_end|>
<|im_start|>user
> You peer into the darkness.<|im_end|>
<|im_start|>assistant
You have been eaten by a grue.<|im_end|>
As such, I would recommend using that format for inference.
Credits
Wayfarer-2-12B was made by Latitude Games with help from Gryphe Padar
Four Over Six was discovered by Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han
- Downloads last month
- -
Model tree for DataSnake/Wayfarer-2-12B-NVFP4-4over6
Base model
mistralai/Mistral-Nemo-Base-2407