This is a preliminary version (and subject to change) of the FP8 quantized google/gemma-4-26B-A4B-it model, distributed by BC Card.
This work was conducted with reference to the Red Hat AI methodology and deployment approach.

The model has both weights and activations quantized to FP8 using vllm-project/llm-compressor.

Run it with:

vllm serve BCcard/gemma-4-26B-A4B-it-FP8-Dynamic --max-model-len 96000

on vLLM main (nightly recommended based on Red Hat AI reference setup).

On a single B200:

lm_eval \
  --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=BCcard/gemma-4-26B-A4B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
  --num_fewshot 5 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --output_path results_gsm8k_platinum.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"

Original:

|         Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum_cot_llama|      3|flexible-extract|     5|exact_match|↑  |0.9702|±  |0.0049|
|                        |       |strict-match    |     5|exact_match|↑  |0.9702|±  |0.0049|

FP8 (BC Card distribution, validated with reference to the Red Hat AI approach):

|         Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum_cot_llama|      3|flexible-extract|     5|exact_match|↑  |0.9669|±  |0.0051|
|                        |       |strict-match    |     5|exact_match|↑  |0.9669|±  |0.0051|