File size: 2,980 Bytes
0b56a8f f759e68 0b56a8f f05abf0 0b56a8f f759e68 0b56a8f f759e68 0b56a8f f759e68 0b56a8f f759e68 0b56a8f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | ---
base_model:
- google/gemma-4-31B-it
tags:
- gemma4
- fp8
- vllm
- compressed-tensors
- bccard
- redhat-ai-reference
name: BCcard/gemma-4-31B-it-FP8-Dynamic
---
This is a preliminary version (and subject to change) of the FP8 quantized `google/gemma-4-31B-it` model, distributed by **BC Card**.
This work was carried out with reference to the **Red Hat AI** approach and validation methodology.
The model has both weights and activations quantized to FP8 with `vllm-project/llm-compressor`.
This model requires a nightly `vllm` wheel. For the reference installation and execution flow, see the Red Hat AI / vLLM-based guidance:
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm
On a single B200:
```
lm_eval \
--model local-chat-completions \
--tasks gsm8k_platinum_cot_llama \
--model_args "model=BCcard/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
--num_fewshot 5 \
--apply_chat_template \
--fewshot_as_multiturn \
--output_path results_gsm8k_platinum.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
```
This is a preliminary version (and subject to change) of the FP8 quantized google/gemma-4-31B-it model. The model has both weights and activations quantized to FP8 with vllm-project/llm-compressor.
This model requires a nightly vllm wheel, see install instructions at https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm
On a single B200:
```
lm_eval --model local-chat-completions --tasks gsm8k_platinum_cot_llama --model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" --num_fewshot 5 --apply_chat_template --fewshot_as_multiturn --output_path results_gsm8k_platinum.json --seed 1234 --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
```
Original:
```
| Tasks |Version| Filter |n-shot| Metric | |Value| |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k_platinum_cot_llama| 3|flexible-extract| 5|exact_match|↑ |0.976|± |0.0044|
| | |strict-match | 5|exact_match|↑ |0.976|± |0.0044|
```
FP8:
```
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum_cot_llama| 3|flexible-extract| 5|exact_match|↑ |0.9768|± |0.0043|
| | |strict-match | 5|exact_match|↑ |0.9777|± |0.0043|
``` |