gemma-4-31B-it-NVFP4
Model Overview
- Model Architecture: google/gemma-4-31B-it
- Input: Text / Image
- Output: Text
- Model Optimizations:
- Weight quantization: FP4
- Activation quantization: FP4
- Release Date: 2026-04-04
- Version: 1.0
- Model Developers: RedHatAI
This model is a quantized version of google/gemma-4-31B-it. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.
Model Optimizations
This model was obtained by quantizing the weights and activations of google/gemma-4-31B-it to FP4 data type, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. Vision tower, embedding, and output head layers are kept in their original precision.
Deployment
Use with vLLM
This model can be deployed using vLLM. For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the Gemma 4 vLLM usage guide.
- Start the vLLM server:
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 --max-model-len 32768
To enable thinking/reasoning and tool calling:
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
--max-model-len 32768 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice
Tip: For text-only workloads, pass
--limit-mm-per-prompt image=0to skip vision encoder memory allocation. Set--gpu-memory-utilization 0.90to maximize KV cache capacity.
- Send requests to the server:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model = "RedHatAI/gemma-4-31B-it-NVFP4"
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = client.chat.completions.create(
model=model,
messages=messages,
)
generated_text = outputs.choices[0].message.content
print(generated_text)
Creation
This model was created by applying LLM Compressor with calibration samples from The Pile, as presented in the code snippet below.
from datasets import load_dataset
from transformers import AutoProcessor, Gemma4ForConditionalGeneration, ProcessorMixin
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "google/gemma-4-31B-it"
model = Gemma4ForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
NUM_CALIBRATION_SAMPLES = 32
MAX_SEQUENCE_LENGTH = 2048
DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"
recipe = [
QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=["re:.*vision.*", "re:.*audio.*", "lm_head", "re:.*embed.*"],
),
]
def get_calib_dataset(processor: ProcessorMixin):
ds = load_dataset(
DATASET_ID,
split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES * 10}]",
)
def preprocess(example):
return {
"input_ids": processor.tokenizer.encode(example["text"].strip())[
:MAX_SEQUENCE_LENGTH
]
}
ds = (
ds.shuffle(seed=42)
.map(preprocess, remove_columns=ds.column_names)
.filter(lambda example: len(example["input_ids"]) >= MAX_SEQUENCE_LENGTH)
.select(range(NUM_CALIBRATION_SAMPLES))
)
return ds
oneshot(
model=model,
processor=processor,
dataset=get_calib_dataset(processor),
recipe=recipe,
batch_size=1,
shuffle_calibration_samples=False,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)
Evaluation
This model was evaluated on GSM8k-Platinum, MMLU-CoT, MMLU-Pro, and IFEval using lm-evaluation-harness, served with vLLM (OpenAI-compatible API). All evaluations were performed with thinking turned off.
Accuracy
| Category | Benchmark | google/gemma-4-31B-it | RedHatAI/gemma-4-31B-it-NVFP4 | Recovery |
|---|---|---|---|---|
| Instruction Following | GSM8k-Platinum (5-shot, strict-match) | 97.60 | 97.71 | 100.1% |
| MMLU-CoT (5-shot, strict_match) | 90.53 | 90.06 | 99.5% | |
| MMLU-Pro (5-shot, custom-extract) | 85.03 | 84.07 | 98.9% | |
| IFEval (0-shot, prompt-level strict) | 91.07 | 90.45 | 99.3% | |
| IFEval (0-shot, inst-level strict) | 93.76 | 93.45 | 99.7% |
Reproduction
The results were obtained using the following commands:
Each benchmark was run 3 times with different random seeds (42, 1234, 4158) and the scores were averaged.
vLLM server:
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 --max-model-len 96000
GSM8k-Platinum (lm-eval, 5-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks gsm8k_platinum_cot_llama \
--model_args "model=RedHatAI/gemma-4-31B-it-NVFP4,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
--num_fewshot 5 \
--apply_chat_template \
--fewshot_as_multiturn \
--output_path results_gsm8k_platinum.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
MMLU-CoT (lm-eval, 5-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks mmlu_cot_llama \
--model_args "model=RedHatAI/gemma-4-31B-it-NVFP4,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
--num_fewshot 5 \
--apply_chat_template \
--fewshot_as_multiturn \
--output_path results_mmlu_cot.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
MMLU-Pro (lm-eval, 5-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks mmlu_pro_chat \
--model_args "model=RedHatAI/gemma-4-31B-it-NVFP4,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
--num_fewshot 5 \
--apply_chat_template \
--fewshot_as_multiturn \
--output_path results_mmlu_pro.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
IFEval (lm-eval, 0-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks ifeval \
--model_args "model=RedHatAI/gemma-4-31B-it-NVFP4,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
--apply_chat_template \
--fewshot_as_multiturn \
--output_path results_ifeval.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
- Downloads last month
- 1,560
Model tree for RedHatAI/gemma-4-31B-it-NVFP4
Base model
google/gemma-4-31B-it