This is a preliminary version (and subject to change) of the FP8 quantized google/gemma-4-31B-it model, distributed by BC Card.
This work was carried out with reference to the Red Hat AI approach and validation methodology.

The model has both weights and activations quantized to FP8 with vllm-project/llm-compressor.

This model requires a nightly vllm wheel. For the reference installation and execution flow, see the Red Hat AI / vLLM-based guidance: https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm

On a single B200:

lm_eval \
  --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=BCcard/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
  --num_fewshot 5 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --output_path results_gsm8k_platinum.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"

This is a preliminary version (and subject to change) of the FP8 quantized google/gemma-4-31B-it model. The model has both weights and activations quantized to FP8 with vllm-project/llm-compressor. This model requires a nightly vllm wheel, see install instructions at https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm

On a single B200:

lm_eval  --model local-chat-completions   --tasks gsm8k_platinum_cot_llama   --model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400"   --num_fewshot 5   --apply_chat_template   --fewshot_as_multiturn   --output_path results_gsm8k_platinum.json   --seed 1234   --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"

Original:

|         Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k_platinum_cot_llama|      3|flexible-extract|     5|exact_match|↑  |0.976|±  |0.0044|
|                        |       |strict-match    |     5|exact_match|↑  |0.976|±  |0.0044|

FP8:

|         Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum_cot_llama|      3|flexible-extract|     5|exact_match|↑  |0.9768|±  |0.0043|
|                        |       |strict-match    |     5|exact_match|↑  |0.9777|±  |0.0043|
Downloads last month
1
Safetensors
Model size
33B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BCCard/gemma-4-31B-it-FP8-Dynamic

Quantized
(95)
this model