YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Evaluations are produced with https://github.com/neuralmagic/GuardBench and vLLM as an inference engine.

Evaluations are obtained with vllm==0.15.0 and bug fixes from this PR.

Dataset	meta-llama/Llama-Guard-4-12B F1	RedHatAI/Llama-Guard-4-12B-quantized.w4a16 (this model) F1	F1 Recovery %	meta-llama/Llama-Guard-4-12B Recall	RedHatAI/Llama-Guard-4-12B-quantized.w4a16 Recall	Recall Recovery %
AART	0.874	0.865	98.97	0.776	0.761	98.07
AdvBench Behaviors	0.964	0.968	100.41	0.931	0.938	100.75
AdvBench Strings	0.83	0.823	99.16	0.709	0.699	98.59
BeaverTails 330k	0.732	0.727	99.32	0.591	0.584	98.82
Bot-Adversarial Dialogue	0.513	0.499	97.27	0.376	0.361	96.01
CatQA	0.932	0.927	99.46	0.873	0.864	98.97
ConvAbuse	0.241	0.248	102.9	0.148	0.156	105.41
DecodingTrust Stereotypes	0.591	0.54	91.37	0.419	0.37	88.31
DICES 350	0.118	0.118	100	0.063	0.063	100
DICES 990	0.219	0.226	103.2	0.135	0.135	100
Do Anything Now Questions	0.746	0.74	99.2	0.595	0.587	98.66
DoNotAnswer	0.546	0.539	98.72	0.376	0.368	97.87
DynaHate	0.603	0.587	97.35	0.481	0.459	95.43
HarmEval	0.56	0.571	101.96	0.389	0.4	102.83
HarmBench Behaviors	0.959	0.954	99.48	0.922	0.912	98.92
HarmfulQ	0.86	0.857	99.65	0.755	0.75	99.34
HarmfulQA Questions	0.588	0.583	99.15	0.416	0.411	98.8
HarmfulQA	0.374	0.347	92.78	0.231	0.21	90.91
HateCheck	0.782	0.77	98.47	0.667	0.649	97.3
Hatemoji Check	0.625	0.609	97.44	0.474	0.457	96.41
HEx-PHI	0.966	0.955	98.86	0.933	0.913	97.86
I-CoNa	0.837	0.813	97.13	0.719	0.685	95.27
I-Controversial	0.596	0.621	104.19	0.425	0.45	105.88
I-MaliciousInstructions	0.824	0.817	99.15	0.7	0.69	98.57
I-Physical-Safety	0.493	0.482	97.77	0.34	0.33	97.06
JBB Behaviors	0.86	0.86	100	0.86	0.86	100
MaliciousInstruct	0.953	0.958	100.52	0.91	0.92	101.1
MITRE	0.663	0.649	97.89	0.495	0.48	96.97
NicheHazardQA	0.46	0.469	101.96	0.299	0.307	102.68
OpenAI Moderation Dataset	0.739	0.741	100.27	0.787	0.78	99.11
ProsocialDialog	0.427	0.414	96.96	0.276	0.265	96.01
SafeText	0.372	0.368	98.92	0.254	0.246	96.85
SimpleSafetyTests	0.985	0.99	100.51	0.97	0.98	101.03
StrongREJECT Instructions	0.91	0.902	99.12	0.836	0.822	98.33
TDCRedTeaming	0.947	0.958	101.16	0.9	0.92	102.22
TechHazardQA	0.758	0.75	98.94	0.61	0.6	98.36
Toxic Chat	0.433	0.433	100	0.519	0.508	97.88
ToxiGen	0.46	0.444	96.52	0.315	0.3	95.24
XSTest	0.834	0.832	99.76	0.78	0.765	98.08
Average Score	0.6711282051	0.6654871795	99.12538462	0.5706410256	0.5629487179	98.45897436

Model creation

This model is created with compressed-tensors==0.13.0 and llmcompressor==0.9.0.1, and the following LLM-Compressor quantization script:

CUDA_VISIBLE_DEVICES=0 python quantize.py --model_path meta-llama/Llama-Guard-4-12B --quant_path RedHatAI/Llama-Guard-4-12B-quantized.w4a16 --group_size 128 --calib_size 1024 --dampening_frac 0.01 --observer minmax --sym True --actorder False --pipeline independent

from datasets import load_dataset
from transformers import AutoProcessor, Llama4ForConditionalGeneration
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot
import argparse
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs, QuantizationType, QuantizationStrategy

def parse_actorder(value):
    # Interpret the input value for --actorder
    if value.lower() == "false":
        return False
    elif value.lower() == "group":
        return "group"
    elif value.lower() == "weight":
        return "weight"
    else:
        raise argparse.ArgumentTypeError("Invalid value for --actorder. Use 'group', 'weight', or 'False'.")

def parse_sym(value):
    if value.lower() == "false":
        return False
    elif value.lower() == "true":
        return True
    else:
        raise argparse.ArgumentTypeError(f"Invalid value for --sym. Use false or true, but got {value}")

parser = argparse.ArgumentParser()
parser.add_argument('--model_path', type=str, required=True)
parser.add_argument('--quant_path', type=str, required=True)
parser.add_argument('--group_size', type=int, required=True)
parser.add_argument('--calib_size', type=int, required=True)
parser.add_argument('--dampening_frac', type=float, required=True)
parser.add_argument('--observer', type=str, required=True) # mse or minmax
parser.add_argument('--sym', type=parse_sym, required=True) # true or false
parser.add_argument('--actorder', type=parse_actorder, required=True) # group or weight or false
parser.add_argument('--pipeline', type=str, default="basic") # ['basic', 'datafree', 'sequential', independent]

args = parser.parse_args()

model = Llama4ForConditionalGeneration.from_pretrained(
    args.model_path,
    torch_dtype="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(args.model_path, trust_remote_code=True)

def preprocess_fn(example):
    # prepare for multimodal processor
    for msg in example["messages"]:
        msg["content"] = [{'type': 'text', 'text': msg['content']}]

    return {"text": processor.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.map(preprocess_fn)

print(f"================================================================================")
print(f"[For debugging] Calibration data sample is:\n{repr(ds[0]['text'])}")
print(f"================================================================================")

quant_scheme = QuantizationScheme(
    targets=["Linear"],
    weights=QuantizationArgs(
        num_bits=4,
        type=QuantizationType.INT,
        symmetric=args.sym,
        group_size=args.group_size,
        strategy=QuantizationStrategy.GROUP,
        observer=args.observer,
        actorder=args.actorder
    ),
    input_activations=None,
    output_activations=None,
)

recipe = [
    GPTQModifier(
        targets=["Linear"],
        ignore=[
            "re:.*lm_head",
            "re:.*multi_modal_projector",
            "re:.*vision_model",
        ],
        dampening_frac=args.dampening_frac,
        config_groups={"group_0": quant_scheme},
    )
]
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    num_calibration_samples=args.calib_size,
    max_seq_length=4096,
    pipeline=args.pipeline,
)

SAVE_DIR = args.quant_path
model.save_pretrained(SAVE_DIR)
print(f"Model saved to {SAVE_DIR}. Please manually copy other files like tokenizer, proprocessors, etc.")

Downloads last month: 37

Safetensors

Model size

4B params

Tensor type

BF16

I64

I32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support