Model Overview
- Model Architecture: Granite-4.0-h-small
- Input: Text
- Output: Text
- Supported Hardware Microarchitecture: AMD MI350/MI355/MI300
- ROCm: 7.0
- Operating System(s): Linux
- Inference Engine: vllm
- Model Optimizer: AMD-Quark
- Weight quantization: FP8, Static
- Activation quantization: FP8, Static
- Calibration Dataset: Pile
This model was built with ibm-granite/granite-4.0-h-small model by applying AMD-Quark for fp8 quantization.
Model Quantization
The model was quantized from ibm-granite/granite-4.0-h-small using AMD-Quark. Both weights and activations were quantized to FP8 format.
Quantization scripts:
cd Quark/examples/torch/language_modeling
exclude_layers="*router.* *lm_head"
python llm_ptq/quantize_quark.py \
--model_dir $MODEL_DIR \
--output_dir $OUT_DIR \
--quant_scheme fp8 \
--kv_cache_dtype fp8 \
--num_calib_data 128 \
--exclude_layers $exclude_layers \
--model_export hf_format \
--multi_gpu
Evaluation
The model was evaluated on GSM8K.
Scripts:
export MODEL_DIR=granite-4.0-h-small-fp8
export VLLM_USE_V1=1
export VLLM_ROCM_USE_AITER=0
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0
lm_eval --model vllm \
--model_args pretrained=$MODEL_DIR,tensor_parallel_size=1,gpu_memory_utilization=0.75 \
--tasks gsm8k \
--trust_remote_code \
--batch_size 32
Accuracy
| Benchmark | ibm-granite/granite-4.0-h-small | ibm-granite/granite-4.0-h-small-fp8(this model) | Recovery |
| GSMK | 85.60 | 84.53 | 98.75% |
Deployment
Use with vllm
This model can be deployed efficiently using the vllm backend.
License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 115
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for amd/granite-4.0-h-small-fp8
Base model
ibm-granite/granite-4.0-h-small