Gemma 4 12B IT NVFP4

This is an NVIDIA ModelOpt NVFP4 quantization of google/gemma-4-12B-it. The checkpoint is intended for ModelOpt-aware runtimes such as vLLM and TensorRT-LLM. It is not a plain Transformers checkpoint: the weights are packed NVFP4 tensors with ModelOpt scale tensors.

Quantization

  • Quantizer: NVIDIA ModelOpt 0.44.0
  • Model Optimizer examples tag: 0.44.0
  • Quantization format: NVFP4
  • Quantization hardware: 2x NVIDIA GeForce RTX 5090 GPUs (Blackwell, compute capability 12.0)
  • KV-cache quantization: none baked into the checkpoint
  • Calibration data: cnn_dailymail, 512 text samples, sequence length 512
  • Export format: unified Hugging Face checkpoint

This checkpoint was quantized, calibrated, and exported on Blackwell RTX 5090 GPUs. The packed weights target NVIDIA's native Blackwell NVFP4 execution path in runtimes such as vLLM.

Multimodal files from the source checkpoint were preserved, including processor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, and generation_config.json. The ModelOpt export kept the multimodal projection modules unquantized:

  • model.embed_vision*
  • model.embed_audio*
  • lm_head

The exported processor was smoke-tested with image input using the Gemma image token <|image|>.

vLLM Usage

Use a vLLM build with ModelOpt NVFP4 support and run on Blackwell-class GPUs for native NVFP4 execution. Pass quantization="modelopt_fp4" explicitly when loading this checkpoint.

Python

uv pip install -U vllm
from vllm import LLM, SamplingParams

llm = LLM(
    model="berkerdooo/gemma-4-12B-it-NVFP4",
    quantization="modelopt_fp4",
    trust_remote_code=True,
)

outputs = llm.generate(
    ["Explain why the sky is blue."],
    SamplingParams(max_tokens=128, temperature=0),
)
print(outputs[0].outputs[0].text)

OpenAI-Compatible Server

vllm serve berkerdooo/gemma-4-12B-it-NVFP4 \
  --quantization modelopt_fp4 \
  --trust-remote-code
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="berkerdooo/gemma-4-12B-it-NVFP4",
    messages=[
        {
            "role": "user",
            "content": "Explain NVFP4 quantization in one paragraph.",
        }
    ],
    temperature=0,
    max_tokens=128,
)

print(response.choices[0].message.content)

Multimodal Request

The processor, tokenizer, chat template, and image token from the original Gemma 4 checkpoint are included in this repo. With the vLLM server above, image inputs can be sent through the OpenAI-compatible chat API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="berkerdooo/gemma-4-12B-it-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
                    },
                },
            ],
        }
    ],
    temperature=0,
    max_tokens=128,
)

print(response.choices[0].message.content)

If you build prompts manually instead of using the server API, use the standard Gemma 4 multimodal chat template and include the <|image|> token for image inputs.

Reproduction Command

CUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python hf_ptq.py \
  --pyt_ckpt_path google/gemma-4-12B-it \
  --export_path ./gemma-4-12B-it-nvfp4 \
  --qformat nvfp4 \
  --kv_cache_qformat none \
  --dataset cnn_dailymail \
  --calib_size 512 \
  --calib_seq 512 \
  --batch_size 1 \
  --use_seq_device_map \
  --gpu_max_mem_percentage 0.90 \
  --attn_implementation sdpa \
  --skip_generate

Verification

  • Export completed successfully with peak GPU memory of 23.53 GB on GPU 0 and 0.98 GB on GPU 1 using 2x NVIDIA GeForce RTX 5090 GPUs.
  • config.json loads as Gemma4UnifiedForConditionalGeneration.
  • AutoProcessor loads as Gemma4UnifiedProcessor.
  • A dummy image-text processor call produced input_ids, attention_mask, mm_token_type_ids, pixel_values, and image_position_ids.

Plain transformers.AutoModelForCausalLM.from_pretrained is not the target loader for this checkpoint unless the runtime understands ModelOpt-packed NVFP4 weights.

Downloads last month
1,061
Safetensors
Model size
7B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for berkerdooo/gemma-4-12B-it-NVFP4

Quantized
(88)
this model