File size: 2,289 Bytes
5c51437
 
 
 
 
 
 
 
 
 
 
2f4bb7a
7cf2b0a
 
 
5c51437
 
 
 
1b56a4f
5c51437
7cf2b0a
 
 
5c51437
 
 
7cf2b0a
5c51437
 
3ae6ca3
7cf2b0a
 
 
 
 
5c51437
7cf2b0a
 
 
 
 
 
 
 
 
 
1b56a4f
7cf2b0a
 
 
 
 
 
 
 
 
 
 
 
5c51437
7cf2b0a
 
 
 
 
 
 
 
 
 
5c51437
7cf2b0a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
language:
- en
base_model:
- LatitudeGames/Muse-12B
tags:
- text adventure
- roleplay
- nvfp4
- tensorrt-llm
model_size: 12B
datasets:
- agentlans/distilled-roleplay
pipeline_tag: text-generation
---

![image/jpeg](muse.jpg)

# Muse-12B-NVFP4

Quantized NVFP4 weights of the [Muse-12B](https://huggingface.co/LatitudeGames/Muse-12B) model, for use with nVidia Blackwell GPUs.

## Quantization details

Quantized with TensorRT-Model-Optimizer 0.37.0

Calibrated using the [distilled-roleplay](https://huggingface.co/datasets/agentlans/distilled-roleplay) dataset, tagged in the same ChatML format used to train the Wayfarer and Muse models in the first place. This was accomplished by adding the following code to the start of `hf_ptq.py`: 

```
from modelopt.torch.utils import dataset_utils

dataset_utils.SUPPORTED_DATASET_CONFIG["distilled-roleplay"] = {
    "config": {
        "path": "agentlans/distilled-roleplay",
        "split": ["train"],
    },
    "preprocess": lambda sample: "".join(
        f"<|im_start|>{ {'system':'system','human':'user','gpt':'assistant'}[turn['from']] }\n"
        f"{turn['value'].strip()}<|im_end|>\n"
        for turn in sample["conversations"]
    ),
}
```

## Inference

Tested on a RTX 5060 Ti 16GB with TensorRT-LLM, vLLM, SGLang, and Aphrodite Engine.

Recommended generation settings (a mix of what it says on the Muse-12B model card and the [AI Dungeon Model Guide](https://help.aidungeon.com/ai-models-and-their-differences)):
- Temperature: 1.0
- Top K: 250
- Top P: 1
- Min P: 0.025
- Repetition Penalty: 1.05
- Presence Penalty: 0.25

## Prompt Format

As mentioned above, the calibration data was provided with the same ChatML tags as had been used to finetune Latitude's 12B models:
```
<|im_start|>system
You're a masterful storyteller and gamemaster. Write in second person present tense (You are), crafting vivid, engaging narratives with authority and confidence.<|im_end|>
<|im_start|>user
> You peer into the darkness.<|im_end|>
<|im_start|>assistant
You have been eaten by a grue.<|im_end|>
```
As such, I would recommend using that format for inference.

## Credits

Muse-12B was made by [Latitude Games](https://huggingface.co/LatitudeGames) with help from [Gryphe Padar](https://huggingface.co/Gryphe)