Model Overview

  • Model Architecture: Granite-4.0-h-small
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355/MI300
  • ROCm: 7.0
  • Operating System(s): Linux
  • Inference Engine: vllm
  • Model Optimizer: AMD-Quark
    • Weight quantization: FP8, Static
    • Activation quantization: FP8, Static
  • Calibration Dataset: Pile

This model was built with ibm-granite/granite-4.0-h-small model by applying AMD-Quark for fp8 quantization.

Model Quantization

The model was quantized from ibm-granite/granite-4.0-h-small using AMD-Quark. Both weights and activations were quantized to FP8 format.

Quantization scripts:

cd Quark/examples/torch/language_modeling
exclude_layers="*router.* *lm_head"

python llm_ptq/quantize_quark.py \
                          --model_dir $MODEL_DIR \
                          --output_dir $OUT_DIR \
                          --quant_scheme fp8 \
                          --kv_cache_dtype fp8 \
                          --num_calib_data 128 \
                          --exclude_layers $exclude_layers \
                          --model_export hf_format \
                          --multi_gpu

Evaluation

The model was evaluated on GSM8K.

Scripts:

export MODEL_DIR=granite-4.0-h-small-fp8
export VLLM_USE_V1=1
export VLLM_ROCM_USE_AITER=0
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0

lm_eval --model vllm \
    --model_args pretrained=$MODEL_DIR,tensor_parallel_size=1,gpu_memory_utilization=0.75 \
    --tasks gsm8k \
    --trust_remote_code \
    --batch_size 32

Accuracy

Benchmark ibm-granite/granite-4.0-h-small ibm-granite/granite-4.0-h-small-fp8(this model) Recovery
GSMK 85.60 84.53 98.75%

Deployment

Use with vllm

This model can be deployed efficiently using the vllm backend.

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
115
Safetensors
Model size
32B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/granite-4.0-h-small-fp8

Quantized
(33)
this model