DataSnake
/

Muse-12B-NVFP4

@@ -10,30 +10,64 @@ tags:
 - nvfp4
 - tensorrt-llm
 model_size: 12B
 ---
 ![image/jpeg](muse.jpg)
 # Muse-12B
-Quantized NVFP4 weights of the [Muse-12B](https://huggingface.co/LatitudeGames/Muse-12B) model.
 Quantized with TensorRT-Model-Optimizer 0.37.0
-Calibrated using the [distilled-roleplay](https://huggingface.co/datasets/agentlans/distilled-roleplay) dataset, tagged in the same ChatML format used to train the Wayfarer and Muse models in the first place. This was accomplished by adding the following code to `SUPPORTED_DATASET_CONFIG` inside dataset_utils.py:
 ```
-    "distilled-roleplay": {
-        "config": {
-            "path": "agentlans/distilled-roleplay",
-            "split": ["train"],
-        },
-        "preprocess": lambda sample: "".join(
-            f"<|im_start|>{ {'system':'system','human':'user','gpt':'assistant'}[turn['from']] }\n"
-            f"{turn['value'].strip()}<|im_end|>\n"
-            for turn in sample["conversations"]
-        ),
     },
 ```
-Tested on TensorRT-LLM on a RTX 5060 Ti.

 - nvfp4
 - tensorrt-llm
 model_size: 12B
+datasets:
+- agentlans/distilled-roleplay
+pipeline_tag: text-generation
 ---
 ![image/jpeg](muse.jpg)
 # Muse-12B
+Quantized NVFP4 weights of the [Muse-12B](https://huggingface.co/LatitudeGames/Muse-12B) model, for use with nVidia Blackwell GPUs.
+## Quantization details
 Quantized with TensorRT-Model-Optimizer 0.37.0
+Calibrated using the [distilled-roleplay](https://huggingface.co/datasets/agentlans/distilled-roleplay) dataset, tagged in the same ChatML format used to train the Wayfarer and Muse models in the first place. This was accomplished by adding the following code to the start of `hf_ptq.py`:
 ```
+import modelopt.torch.utils import dataset_utils
+dataset_utils.SUPPORTED_DATASET_CONFIG["distilled-roleplay"] = {
+    "config": {
+        "path": "agentlans/distilled-roleplay",
+        "split": ["train"],
     },
+    "preprocess": lambda sample: "".join(
+        f"<|im_start|>{ {'system':'system','human':'user','gpt':'assistant'}[turn['from']] }\n"
+        f"{turn['value'].strip()}<|im_end|>\n"
+        for turn in sample["conversations"]
+    ),
+}
+```
+## Inference
+Tested on a RTX 5060 Ti 16GB with TensorRT-LLM, vLLM, and SGLang. Of the three, I found vLLM to be the best. TensorRT-LLM couldn't handle as large a context window as the other two, and SGLang had fewer sampling options available.
+Recommended generation settings (a mix of what it says on the Muse-12B model card and the [AI Dungeon Model Guide](https://help.aidungeon.com/ai-models-and-their-differences)):
+- Temperature: 1.0
+- Top K: 250
+- Top P: 1
+- Min P: 0.025
+- Repetition Penalty: 1.05
+- Presence Penalty: 0.25
+## Prompt Format
+As mentioned above, the calibration data was provided with the same ChatML tags as had been used to finetune Latitude's 12B models:
 ```
+<|im_start|>system
+You're a masterful storyteller and gamemaster. Write in second person present tense (You are), crafting vivid, engaging narratives with authority and confidence.<|im_end|>
+<|im_start|>user
+> You peer into the darkness.<|im_end|>
+<|im_start|>assistant
+You have been eaten by a grue.<|im_end|>
+```
+As such, I would recommend using that format for inference.
+## Credits
+Muse-12B was made by [Latitude Games](https://huggingface.co/LatitudeGames) with help from [Gryphe Padar](https://huggingface.co/Gryphe)