RedHatAI
/

Llama-Guard-4-12B-FP8-dynamic

Safetensors

llama4

compressed-tensors

Model card Files Files and versions

xet

Community

ekurtic commited on Feb 16

Commit

1ac8d87

verified ·

1 Parent(s): ffd7cae

Update README.md

Browse files

Files changed (1) hide show

README.md +77 -1

README.md CHANGED Viewed

@@ -44,4 +44,80 @@ If you are running `vllm > 0.15.0`, you will likely have the bug fixes already a
 | Toxic Chat                | 0.433        | 0.425        | 98.15         | 0.519        | 0.519        | 100               |
 | ToxiGen                   | 0.46         | 0.47         | 102.17        | 0.315        | 0.325        | 103.17            |
 | XSTest                    | 0.834        | 0.833        | 99.88         | 0.78         | 0.775        | 99.36             |
-| Average Score             | 0.6711282051 | 0.6729230769 | 100.5220513   | 0.5706410256 | 0.5725641026 | 100.8784615       |

 | Toxic Chat                | 0.433        | 0.425        | 98.15         | 0.519        | 0.519        | 100               |
 | ToxiGen                   | 0.46         | 0.47         | 102.17        | 0.315        | 0.325        | 103.17            |
 | XSTest                    | 0.834        | 0.833        | 99.88         | 0.78         | 0.775        | 99.36             |
+| Average Score             | 0.6711282051 | 0.6729230769 | 100.5220513   | 0.5706410256 | 0.5725641026 | 100.8784615       |
+## Model creation
+This model is created with `compressed-tensors==0.13.0` and `llmcompressor==0.9.0.1`, and the following LLM-Compressor quantization script:
+```bash
+CUDA_VISIBLE_DEVICES=0 python quantize.py --model_path meta-llama/Llama-Guard-4-12B RedHatAI/Llama-Guard-4-12B-FP8-dynamic --pipeline datafree
+```
+```python
+import argparse
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM, Llama4ForConditionalGeneration
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor import oneshot
+from compressed_tensors.quantization import (
+    QuantizationScheme,
+    QuantizationArgs,
+    QuantizationType,
+    QuantizationStrategy,
+)
+def main():
+    parser = argparse.ArgumentParser(description="Quantize a causal language model")
+    parser.add_argument(
+        "--model_path",
+        type=str,
+        required=True,
+        help="Path to the pre-trained model",
+    )
+    parser.add_argument(
+        "--quant_path",
+        type=str,
+        required=True,
+        help="Output path for the quantized model",
+    )
+    parser.add_argument(
+        "--pipeline", #['basic', 'datafree', 'sequential', independent]
+        type=str,
+        required=True,
+    )
+    print(f"Loading model from {args.model_path}...")
+    model = Llama4ForConditionalGeneration.from_pretrained(
+        args.model_path,
+        torch_dtype="auto",
+        trust_remote_code=True,
+    )
+    recipe = QuantizationModifier(
+        targets="Linear",
+        scheme="FP8_dynamic",
+        ignore=[
+            're:.*lm_head',
+            're:.*multi_modal_projector',
+            're:.*vision_model',
+        ]
+    )
+    print("Applying quantization...")
+    oneshot(
+        model=model,
+        recipe=recipe,
+        trust_remote_code_model=True,
+        pipeline=args.pipeline,
+    )
+    model.save_pretrained(args.quant_path, save_compressed=True, skip_compression_stats=True, disable_sparse_compression=True)
+    print(f"Quantized model saved to {args.quant_path}")
+if __name__ == "__main__":
+    main()
+```