alpindale
/

Mistral-Large-Instruct-2407-FP8

+---
+license: other
+license_name: mrl
+license_link: https://mistral.ai/licenses/MRL-0.1.md
+base_model: mistralai/Mistral-Large-Instruct-2407
+language:
+  - en
+  - fr
+  - de
+  - es
+  - it
+  - pt
+  - ru
+  - zh
+  - ja
+pipeline_tag: text-generation
+tags:
+- chat
+---
+# Mistral-Large-Instruct-2407 FP8
+This repository contains the quantized weights for [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407).
+The weights have been converted to FP8 format, with FP8 weights, FP8 activations, and FP8 KV cache. You can use either [vLLM](https://github.com/vllm-project/vllm) or [Aphrodite Engine](https://github.com/PygmalionAI/aphrodite-engine) to load this model.
+## Quantization Method
+The library used is [llm-compressor](https://github.com/vllm-project/llm-compressor).
+```console
+pip install llmcompressor
+```
+Then run this script:
+```py
+from datasets import load_dataset
+from transformers import AutoTokenizer
+from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
+MODEL_ID = "mistralai/Mistral-Large-Instruct-2407"
+model = SparseAutoModelForCausalLM.from_pretrained(
+    MODEL_ID,
+    device_map="auto",
+    torch_dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+# Select calibration dataset.
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"  # Or use your own dataset
+DATASET_SPLIT = "train_sft"
+# You can increase the the number of samples to increase accuracy
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048
+ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+def process_and_tokenize(example):
+    text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
+    return tokenizer(
+        text,
+        padding=False,
+        max_length=MAX_SEQUENCE_LENGTH,
+        truncation=True,
+        add_special_tokens=False,
+    )
+ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
+# Configure the quantization algorithm and scheme.
+# In this case, we:
+#   * quantize the weights to fp8 with per-tensor scales
+#   * quantize the activations to fp8 with per-tensor scales
+#   * quantize the kv cache to fp8 with per-tensor scales
+recipe = """
+quant_stage:
+    quant_modifiers:
+        QuantizationModifier:
+            ignore: ["lm_head"]
+            config_groups:
+                group_0:
+                    weights:
+                        num_bits: 8
+                        type: float
+                        strategy: tensor
+                        dynamic: false
+                        symmetric: true
+                    input_activations:
+                        num_bits: 8
+                        type: float
+                        strategy: tensor
+                        dynamic: false
+                        symmetric: true
+                    targets: ["Linear"]
+            kv_cache_scheme:
+                num_bits: 8
+                type: float
+                strategy: tensor
+                dynamic: false
+                symmetric: true
+"""
+# Apply algorithms.
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+# Save to disk compressed.
+SAVE_DIR = "./Mistral-Large-Instruct-2407-FP8"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)