Gemma 4 12B IT NVFP4
This is an NVIDIA ModelOpt NVFP4 quantization of google/gemma-4-12B-it.
The checkpoint is intended for ModelOpt-aware runtimes such as vLLM and
TensorRT-LLM. It is not a plain Transformers checkpoint: the weights are packed
NVFP4 tensors with ModelOpt scale tensors.
Quantization
- Quantizer: NVIDIA ModelOpt
0.44.0 - Model Optimizer examples tag:
0.44.0 - Quantization format:
NVFP4 - Quantization hardware: 2x NVIDIA GeForce RTX 5090 GPUs (Blackwell, compute capability 12.0)
- KV-cache quantization: none baked into the checkpoint
- Calibration data:
cnn_dailymail, 512 text samples, sequence length 512 - Export format: unified Hugging Face checkpoint
This checkpoint was quantized, calibrated, and exported on Blackwell RTX 5090 GPUs. The packed weights target NVIDIA's native Blackwell NVFP4 execution path in runtimes such as vLLM.
Multimodal files from the source checkpoint were preserved, including
processor_config.json, tokenizer.json, tokenizer_config.json,
chat_template.jinja, and generation_config.json. The ModelOpt export kept
the multimodal projection modules unquantized:
model.embed_vision*model.embed_audio*lm_head
The exported processor was smoke-tested with image input using the Gemma image
token <|image|>.
vLLM Usage
Use a vLLM build with ModelOpt NVFP4 support and run on Blackwell-class GPUs for
native NVFP4 execution. Pass quantization="modelopt_fp4" explicitly when
loading this checkpoint.
Python
uv pip install -U vllm
from vllm import LLM, SamplingParams
llm = LLM(
model="berkerdooo/gemma-4-12B-it-NVFP4",
quantization="modelopt_fp4",
trust_remote_code=True,
)
outputs = llm.generate(
["Explain why the sky is blue."],
SamplingParams(max_tokens=128, temperature=0),
)
print(outputs[0].outputs[0].text)
OpenAI-Compatible Server
vllm serve berkerdooo/gemma-4-12B-it-NVFP4 \
--quantization modelopt_fp4 \
--trust-remote-code
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="berkerdooo/gemma-4-12B-it-NVFP4",
messages=[
{
"role": "user",
"content": "Explain NVFP4 quantization in one paragraph.",
}
],
temperature=0,
max_tokens=128,
)
print(response.choices[0].message.content)
Multimodal Request
The processor, tokenizer, chat template, and image token from the original Gemma 4 checkpoint are included in this repo. With the vLLM server above, image inputs can be sent through the OpenAI-compatible chat API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="berkerdooo/gemma-4-12B-it-NVFP4",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
},
},
],
}
],
temperature=0,
max_tokens=128,
)
print(response.choices[0].message.content)
If you build prompts manually instead of using the server API, use the standard
Gemma 4 multimodal chat template and include the <|image|> token for image
inputs.
Reproduction Command
CUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python hf_ptq.py \
--pyt_ckpt_path google/gemma-4-12B-it \
--export_path ./gemma-4-12B-it-nvfp4 \
--qformat nvfp4 \
--kv_cache_qformat none \
--dataset cnn_dailymail \
--calib_size 512 \
--calib_seq 512 \
--batch_size 1 \
--use_seq_device_map \
--gpu_max_mem_percentage 0.90 \
--attn_implementation sdpa \
--skip_generate
Verification
- Export completed successfully with peak GPU memory of 23.53 GB on GPU 0 and 0.98 GB on GPU 1 using 2x NVIDIA GeForce RTX 5090 GPUs.
config.jsonloads asGemma4UnifiedForConditionalGeneration.AutoProcessorloads asGemma4UnifiedProcessor.- A dummy image-text processor call produced
input_ids,attention_mask,mm_token_type_ids,pixel_values, andimage_position_ids.
Plain transformers.AutoModelForCausalLM.from_pretrained is not the target
loader for this checkpoint unless the runtime understands ModelOpt-packed NVFP4
weights.
- Downloads last month
- 1,061