YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
Evaluations are produced with https://github.com/neuralmagic/GuardBench and vLLM as an inference engine.
Evaluations are obtained with vllm==0.15.0 and bug fixes from this PR.
| Dataset | meta-llama/Llama-Guard-4-12B F1 |
RedHatAI/Llama-Guard-4-12B-quantized.w4a16 (this model) F1 |
F1 Recovery % | meta-llama/Llama-Guard-4-12B Recall |
RedHatAI/Llama-Guard-4-12B-quantized.w4a16 Recall |
Recall Recovery % |
|---|---|---|---|---|---|---|
| AART | 0.874 | 0.865 | 98.97 | 0.776 | 0.761 | 98.07 |
| AdvBench Behaviors | 0.964 | 0.968 | 100.41 | 0.931 | 0.938 | 100.75 |
| AdvBench Strings | 0.83 | 0.823 | 99.16 | 0.709 | 0.699 | 98.59 |
| BeaverTails 330k | 0.732 | 0.727 | 99.32 | 0.591 | 0.584 | 98.82 |
| Bot-Adversarial Dialogue | 0.513 | 0.499 | 97.27 | 0.376 | 0.361 | 96.01 |
| CatQA | 0.932 | 0.927 | 99.46 | 0.873 | 0.864 | 98.97 |
| ConvAbuse | 0.241 | 0.248 | 102.9 | 0.148 | 0.156 | 105.41 |
| DecodingTrust Stereotypes | 0.591 | 0.54 | 91.37 | 0.419 | 0.37 | 88.31 |
| DICES 350 | 0.118 | 0.118 | 100 | 0.063 | 0.063 | 100 |
| DICES 990 | 0.219 | 0.226 | 103.2 | 0.135 | 0.135 | 100 |
| Do Anything Now Questions | 0.746 | 0.74 | 99.2 | 0.595 | 0.587 | 98.66 |
| DoNotAnswer | 0.546 | 0.539 | 98.72 | 0.376 | 0.368 | 97.87 |
| DynaHate | 0.603 | 0.587 | 97.35 | 0.481 | 0.459 | 95.43 |
| HarmEval | 0.56 | 0.571 | 101.96 | 0.389 | 0.4 | 102.83 |
| HarmBench Behaviors | 0.959 | 0.954 | 99.48 | 0.922 | 0.912 | 98.92 |
| HarmfulQ | 0.86 | 0.857 | 99.65 | 0.755 | 0.75 | 99.34 |
| HarmfulQA Questions | 0.588 | 0.583 | 99.15 | 0.416 | 0.411 | 98.8 |
| HarmfulQA | 0.374 | 0.347 | 92.78 | 0.231 | 0.21 | 90.91 |
| HateCheck | 0.782 | 0.77 | 98.47 | 0.667 | 0.649 | 97.3 |
| Hatemoji Check | 0.625 | 0.609 | 97.44 | 0.474 | 0.457 | 96.41 |
| HEx-PHI | 0.966 | 0.955 | 98.86 | 0.933 | 0.913 | 97.86 |
| I-CoNa | 0.837 | 0.813 | 97.13 | 0.719 | 0.685 | 95.27 |
| I-Controversial | 0.596 | 0.621 | 104.19 | 0.425 | 0.45 | 105.88 |
| I-MaliciousInstructions | 0.824 | 0.817 | 99.15 | 0.7 | 0.69 | 98.57 |
| I-Physical-Safety | 0.493 | 0.482 | 97.77 | 0.34 | 0.33 | 97.06 |
| JBB Behaviors | 0.86 | 0.86 | 100 | 0.86 | 0.86 | 100 |
| MaliciousInstruct | 0.953 | 0.958 | 100.52 | 0.91 | 0.92 | 101.1 |
| MITRE | 0.663 | 0.649 | 97.89 | 0.495 | 0.48 | 96.97 |
| NicheHazardQA | 0.46 | 0.469 | 101.96 | 0.299 | 0.307 | 102.68 |
| OpenAI Moderation Dataset | 0.739 | 0.741 | 100.27 | 0.787 | 0.78 | 99.11 |
| ProsocialDialog | 0.427 | 0.414 | 96.96 | 0.276 | 0.265 | 96.01 |
| SafeText | 0.372 | 0.368 | 98.92 | 0.254 | 0.246 | 96.85 |
| SimpleSafetyTests | 0.985 | 0.99 | 100.51 | 0.97 | 0.98 | 101.03 |
| StrongREJECT Instructions | 0.91 | 0.902 | 99.12 | 0.836 | 0.822 | 98.33 |
| TDCRedTeaming | 0.947 | 0.958 | 101.16 | 0.9 | 0.92 | 102.22 |
| TechHazardQA | 0.758 | 0.75 | 98.94 | 0.61 | 0.6 | 98.36 |
| Toxic Chat | 0.433 | 0.433 | 100 | 0.519 | 0.508 | 97.88 |
| ToxiGen | 0.46 | 0.444 | 96.52 | 0.315 | 0.3 | 95.24 |
| XSTest | 0.834 | 0.832 | 99.76 | 0.78 | 0.765 | 98.08 |
| Average Score | 0.6711282051 | 0.6654871795 | 99.12538462 | 0.5706410256 | 0.5629487179 | 98.45897436 |
Model creation
This model is created with compressed-tensors==0.13.0 and llmcompressor==0.9.0.1, and the following LLM-Compressor quantization script:
CUDA_VISIBLE_DEVICES=0 python quantize.py --model_path meta-llama/Llama-Guard-4-12B --quant_path RedHatAI/Llama-Guard-4-12B-quantized.w4a16 --group_size 128 --calib_size 1024 --dampening_frac 0.01 --observer minmax --sym True --actorder False --pipeline independent
from datasets import load_dataset
from transformers import AutoProcessor, Llama4ForConditionalGeneration
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot
import argparse
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs, QuantizationType, QuantizationStrategy
def parse_actorder(value):
# Interpret the input value for --actorder
if value.lower() == "false":
return False
elif value.lower() == "group":
return "group"
elif value.lower() == "weight":
return "weight"
else:
raise argparse.ArgumentTypeError("Invalid value for --actorder. Use 'group', 'weight', or 'False'.")
def parse_sym(value):
if value.lower() == "false":
return False
elif value.lower() == "true":
return True
else:
raise argparse.ArgumentTypeError(f"Invalid value for --sym. Use false or true, but got {value}")
parser = argparse.ArgumentParser()
parser.add_argument('--model_path', type=str, required=True)
parser.add_argument('--quant_path', type=str, required=True)
parser.add_argument('--group_size', type=int, required=True)
parser.add_argument('--calib_size', type=int, required=True)
parser.add_argument('--dampening_frac', type=float, required=True)
parser.add_argument('--observer', type=str, required=True) # mse or minmax
parser.add_argument('--sym', type=parse_sym, required=True) # true or false
parser.add_argument('--actorder', type=parse_actorder, required=True) # group or weight or false
parser.add_argument('--pipeline', type=str, default="basic") # ['basic', 'datafree', 'sequential', independent]
args = parser.parse_args()
model = Llama4ForConditionalGeneration.from_pretrained(
args.model_path,
torch_dtype="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(args.model_path, trust_remote_code=True)
def preprocess_fn(example):
# prepare for multimodal processor
for msg in example["messages"]:
msg["content"] = [{'type': 'text', 'text': msg['content']}]
return {"text": processor.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)
print(f"================================================================================")
print(f"[For debugging] Calibration data sample is:\n{repr(ds[0]['text'])}")
print(f"================================================================================")
quant_scheme = QuantizationScheme(
targets=["Linear"],
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
symmetric=args.sym,
group_size=args.group_size,
strategy=QuantizationStrategy.GROUP,
observer=args.observer,
actorder=args.actorder
),
input_activations=None,
output_activations=None,
)
recipe = [
GPTQModifier(
targets=["Linear"],
ignore=[
"re:.*lm_head",
"re:.*multi_modal_projector",
"re:.*vision_model",
],
dampening_frac=args.dampening_frac,
config_groups={"group_0": quant_scheme},
)
]
oneshot(
model=model,
dataset=ds,
recipe=recipe,
num_calibration_samples=args.calib_size,
max_seq_length=4096,
pipeline=args.pipeline,
)
SAVE_DIR = args.quant_path
model.save_pretrained(SAVE_DIR)
print(f"Model saved to {SAVE_DIR}. Please manually copy other files like tokenizer, proprocessors, etc.")
- Downloads last month
- 37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support