BCCard
/

gemma-4-31B-it-FP8-Dynamic

compressed-tensors

redhat-ai-reference

Model card Files Files and versions

sh2orc commited on 9 days ago

Commit

f759e68

·

verified ·

1 Parent(s): 0b56a8f

Update README.md

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -23,7 +23,7 @@ https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vl
 On a single B200:
-```bash
 lm_eval \
   --model local-chat-completions \
   --tasks gsm8k_platinum_cot_llama \
@@ -34,12 +34,14 @@ lm_eval \
   --output_path results_gsm8k_platinum.json \
   --seed 1234 \
   --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
-This is a preliminary version (and subject to change) of the FP8 quantized google/gemma-4-31B-it model. The model has both weights and activations quantized to FP8 with vllm-project/llm-compressor.
 This model requires a nightly vllm wheel, see install instructions at https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm
 On a single B200:
 ```
 lm_eval  --model local-chat-completions   --tasks gsm8k_platinum_cot_llama   --model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400"   --num_fewshot 5   --apply_chat_template   --fewshot_as_multiturn   --output_path results_gsm8k_platinum.json   --seed 1234   --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
 ```
@@ -56,7 +58,7 @@ Original:
 FP8:
-```bash
 |         Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k_platinum_cot_llama|      3|flexible-extract|     5|exact_match|↑  |0.9768|±  |0.0043|

 On a single B200:
+```
 lm_eval \
   --model local-chat-completions \
   --tasks gsm8k_platinum_cot_llama \
   --output_path results_gsm8k_platinum.json \
   --seed 1234 \
   --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
+```
+This is a preliminary version (and subject to change) of the FP8 quantized google/gemma-4-31B-it model. The model has both weights and activations quantized to FP8 with vllm-project/llm-compressor.
 This model requires a nightly vllm wheel, see install instructions at https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#installing-vllm
 On a single B200:
 ```
 lm_eval  --model local-chat-completions   --tasks gsm8k_platinum_cot_llama   --model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=96000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=2400"   --num_fewshot 5   --apply_chat_template   --fewshot_as_multiturn   --output_path results_gsm8k_platinum.json   --seed 1234   --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=64000,seed=1234"
 ```
 FP8:
+```
 |         Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k_platinum_cot_llama|      3|flexible-extract|     5|exact_match|↑  |0.9768|±  |0.0043|