Instructions to use RedHatAI/gemma-4-31B-it-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RedHatAI/gemma-4-31B-it-FP8-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="RedHatAI/gemma-4-31B-it-FP8-Dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("RedHatAI/gemma-4-31B-it-FP8-Dynamic") model = AutoModelForMultimodalLM.from_pretrained("RedHatAI/gemma-4-31B-it-FP8-Dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use RedHatAI/gemma-4-31B-it-FP8-Dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RedHatAI/gemma-4-31B-it-FP8-Dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/gemma-4-31B-it-FP8-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/RedHatAI/gemma-4-31B-it-FP8-Dynamic
- SGLang
How to use RedHatAI/gemma-4-31B-it-FP8-Dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RedHatAI/gemma-4-31B-it-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/gemma-4-31B-it-FP8-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RedHatAI/gemma-4-31B-it-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/gemma-4-31B-it-FP8-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use RedHatAI/gemma-4-31B-it-FP8-Dynamic with Docker Model Runner:
docker model run hf.co/RedHatAI/gemma-4-31B-it-FP8-Dynamic
gemma-4-31B-it-FP8-Dynamic
Model Overview
- Model Architecture: google/gemma-4-31B-it
- Input: Text / Image
- Output: Text
- Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
- Release Date: 2026-04-04
- Version: 1.0
- Model Developers: RedHatAI
This model is a quantized version of google/gemma-4-31B-it. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.
Model Optimizations
This model was obtained by quantizing the weights and activations of google/gemma-4-31B-it to FP8 data type using dynamic per-token quantization, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
Weights are quantized statically using per-channel FP8 scaling, and activations are quantized dynamically at inference time using per-token scaling. Only the weights and activations of the linear operators within transformer blocks are quantized using LLM Compressor. Vision tower, embedding, and output head layers are kept in their original precision.
Deployment
Use with vLLM
This model can be deployed using vLLM. For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the Gemma 4 vLLM usage guide.
- Start the vLLM server:
vllm serve RedHatAI/gemma-4-31B-it-FP8-Dynamic \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
To enable thinking/reasoning and tool calling:
vllm serve RedHatAI/gemma-4-31B-it-FP8-Dynamic \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt '{"image": 4, "audio": 1}' \
--async-scheduling
Tip: For text-only workloads, pass
--limit-mm-per-prompt '{"image": 0, "audio": 0}'to skip vision encoder memory allocation and free up GPU memory for a longer context window.
- Send requests to the server:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model = "RedHatAI/gemma-4-31B-it-FP8-Dynamic"
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = client.chat.completions.create(
model=model,
messages=messages,
)
generated_text = outputs.choices[0].message.content
print(generated_text)
Creation
This model was created by applying data-free FP8 dynamic quantization with LLM Compressor, as presented in the code snippet below.
from llmcompressor import model_free_ptq
MODEL_ID = "google/gemma-4-31B-it"
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model_free_ptq(
model_stub=MODEL_ID,
save_directory=SAVE_DIR,
scheme="FP8_DYNAMIC",
ignore=["re:.*vision.*", "lm_head", "re:.*embed_tokens.*"],
)
Evaluation
This model was evaluated on GSM8K Platinum, MMLU-Pro, IFEval, MATH-500, AIME 2025, GPQA Diamond, and LiveCodeBench v6 using lm-evaluation-harness and lighteval, served with vLLM (OpenAI-compatible API). All evaluations were performed with thinking enabled.
Accuracy
| Category | Benchmark | google/gemma-4-31B-it | RedHatAI/gemma-4-31B-it-FP8-Dynamic | Recovery |
|---|---|---|---|---|
| Instruction Following | IFEval (0-shot, prompt-level strict) | 90.70 | 91.07 | 100.4% |
| IFEval (0-shot, inst-level strict) | 93.45 | 93.76 | 100.3% | |
| Reasoning | GSM8K Platinum (0-shot, strict-match) | 95.78 | 95.83 | 100.1% |
| MMLU-Pro (0-shot, custom-extract) | 85.41 | 85.32 | 99.9% | |
| MATH-500 (0-shot, pass@1) | 89.40 | 90.27 | 101.0% | |
| AIME 2025 (0-shot, pass@1) | 65.83 | 66.25 | 100.6% | |
| GPQA Diamond (0-shot, pass@1) | 77.44 | 78.11 | 100.9% | |
| Coding | LiveCodeBench v6 (0-shot, pass@1) | 71.43 | 70.67 | 98.9% |
Reproduction
The results were obtained using the following commands:
Each benchmark was run 3 times with different random seeds (1234, 2345, 3456) and the scores were averaged; AIME 2025 used 8 seeds.
vLLM server (instruction following and reasoning benchmarks):
vllm serve RedHatAI/gemma-4-31B-it-FP8-Dynamic \
--tensor-parallel-size 2 \
--max-model-len 69632 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt '{"image":0,"audio":0}' \
--async-scheduling
GSM8K Platinum (lm-eval, 0-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks gsm8k_platinum_cot_llama \
--model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=36096,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
--num_fewshot 0 \
--apply_chat_template \
--output_path results_gsm8k_platinum.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
MMLU-Pro (lm-eval, 0-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks mmlu_pro_chat \
--model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=36096,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
--num_fewshot 0 \
--apply_chat_template \
--output_path results_mmlu_pro.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
IFEval (lm-eval, 0-shot, 3 repetitions)
lm_eval --model local-chat-completions \
--tasks ifeval \
--model_args "model=RedHatAI/gemma-4-31B-it-FP8-Dynamic,max_length=36096,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
--num_fewshot 0 \
--apply_chat_template \
--output_path results_ifeval.json \
--seed 1234 \
--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025)
litellm_config.yaml:
model_parameters:
provider: hosted_vllm
model_name: hosted_vllm/RedHatAI/gemma-4-31B-it-FP8-Dynamic
base_url: http://0.0.0.0:8000/v1
api_key: ''
timeout: 3600
concurrent_requests: 128
generation_parameters:
temperature: 1.0
max_new_tokens: 65536
top_p: 0.95
top_k: 64
seed: 1234
Run once per seed (changing seed in the config each time):
lighteval endpoint litellm litellm_config.yaml 'math_500|0' \
--output-dir results/ --save-details
lighteval endpoint litellm litellm_config.yaml 'aime25|0' \
--output-dir results/ --save-details
lighteval endpoint litellm litellm_config.yaml 'gpqa:diamond|0' \
--output-dir results/ --save-details
LiveCodeBench v6 (lighteval, 3 repetitions)
vLLM server:
vllm serve RedHatAI/gemma-4-31B-it-FP8-Dynamic \
--tensor-parallel-size 2 \
--max-model-len 36864 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt '{"image":0,"audio":0}' \
--async-scheduling
litellm_config.yaml:
model_parameters:
provider: hosted_vllm
model_name: hosted_vllm/RedHatAI/gemma-4-31B-it-FP8-Dynamic
base_url: http://0.0.0.0:8000/v1
api_key: ''
timeout: 1200
concurrent_requests: 256
generation_parameters:
temperature: 1.0
max_new_tokens: 32768
top_p: 0.95
top_k: 64
seed: 1234
Run once per seed:
lighteval endpoint litellm litellm_config.yaml 'lcb:codegeneration_v6|0' \
--output-dir results/ --save-details
- Downloads last month
- 284,292