test / README.md
jiaxwang's picture
Update README.md
3e5d9eb verified
---
license: apache-2.0
base_model:
- openai/gpt-oss-120b
---
# Model Overview
- **Model Architecture:** gpt-oss-120b
- **Input:** Text
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.2.0
- **PyTorch**: 2.9.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html)
- **moe**
- **Weight quantization:** OCP MXFP4, Static
- **Activation quantization:** FP8, Dynamic
- **qkvo**
- **Weight quantization:** FP8 per_channel, Static
- **Activation quantization:** FP8 per_token, Dynamic
- **kv-cache**
- **Output quantization:** FP8, Static
- **softmax**
- **Output quantization:** FP8, Static
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
This model was built with gpt-oss-120b model by applying [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html) for mixed MXFP4-FP8 quantization.
# Model Quantization
The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8.
**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*lm_head *router*"
python3 internal_scripts/quantize_quark.py \
--model_dir openai/gpt-oss-120b \
--quant_scheme mxfp4_fp8 \
--layer_quant_scheme *q_proj ptpc_fp8 \
--layer_quant_scheme *k_proj ptpc_fp8 \
--layer_quant_scheme *v_proj ptpc_fp8 \
--layer_quant_scheme *o_proj ptpc_fp8 \
--kv_cache_dtype fp8 \
--attention_dtype fp8 \
--exclude_layers $exclude_layers \
--num_calib_data 512 \
--output_dir amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
--model_export hf_format \
--multi_gpu
```
# Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
## Evaluation
The model was evaluated on AIME25 and GPQA Diamond benchmarks with `medium` reasoning effort.
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>gpt-oss-120b </strong>
</td>
<td><strong>gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn(this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>GPQA
</td>
<td>71.21
</td>
<td>71.16
</td>
<td>99.93%
</td>
</tr>
<tr>
<td>AIME25
</td>
<td>78.61
</td>
<td>77.08
</td>
<td>98.06%
</td>
</tr>
</table>
### Reproduction
The results of GPQA Diamond and AIME25 were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-private:mxfp4_fp8_gpt_oss_native_20251226`.
vLLM and AITER are already compiled and pre-installed in the Docker image, there is no need to download or install them again.
#### Launching server
```
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_AITER_FUSED_MOE_A16W4=0
export USE_Q_SCALE=1
vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
--tensor_parallel_size 2 \
--gpu-memory-utilization 0.90 \
--no-enable-prefix-caching \
--max-num-batched-tokens 1024 \
--kv_cache_dtype='fp8'
```
#### Evaluating model in a new terminal
```
export OPENAI_API_KEY="EMPTY"
python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn --eval gpqa,aime25 --reasoning-effort medium --n-threads 128
```
# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.