File size: 2,980 Bytes
0b56a8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f759e68
0b56a8f
 
 
f05abf0
0b56a8f
 
 
 
 
 
f759e68
0b56a8f
f759e68
0b56a8f
 
 
 
f759e68
0b56a8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f759e68
0b56a8f
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
base_model:
  - google/gemma-4-31B-it

tags:
  - gemma4
  - fp8
  - vllm
  - compressed-tensors
  - bccard
  - redhat-ai-reference

name: BCcard/gemma-4-31B-it-FP8-Dynamic
---

This is a preliminary version (and subject to change) of the FP8 quantized `google/gemma-4-31B-it` model, distributed by **BC Card**.  
This work was carried out with reference to the **Red Hat AI** approach and validation methodology.

The model has both weights and activations quantized to FP8 with `vllm-project/llm-compressor`.

This model requires a nightly `vllm` wheel. For the reference installation and execution flow, see the Red Hat AI / vLLM-based guidance:
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm

On a single B200:

```
lm_eval \
  --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=BCcard/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400" \
  --num_fewshot 5 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --output_path results_gsm8k_platinum.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
```

This is a preliminary version (and subject to change) of the FP8 quantized google/gemma-4-31B-it model. The model has both weights and activations quantized to FP8 with vllm-project/llm-compressor.
This model requires a nightly vllm wheel, see install instructions at https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm

On a single B200:


```
lm_eval  --model local-chat-completions   --tasks gsm8k_platinum_cot_llama   --model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400"   --num_fewshot 5   --apply_chat_template   --fewshot_as_multiturn   --output_path results_gsm8k_platinum.json   --seed 1234   --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
```

Original:

```
|         Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k_platinum_cot_llama|      3|flexible-extract|     5|exact_match|↑  |0.976|±  |0.0044|
|                        |       |strict-match    |     5|exact_match|↑  |0.976|±  |0.0044|
```


FP8:

```
|         Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum_cot_llama|      3|flexible-extract|     5|exact_match|↑  |0.9768|±  |0.0043|
|                        |       |strict-match    |     5|exact_match|↑  |0.9777|±  |0.0043|
```