File size: 3,812 Bytes
6d4edc4
 
 
 
 
 
 
 
 
 
 
 
7b20858
 
6d4edc4
 
76677df
 
 
 
 
 
 
 
 
 
 
6d4edc4
 
76677df
6d4edc4
 
 
76677df
6d4edc4
 
 
 
7b20858
6d4edc4
 
 
 
 
 
 
 
 
 
 
 
8f07720
6d4edc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f07720
6d4edc4
 
 
 
 
7b20858
6d4edc4
7b20858
6d4edc4
7b20858
6d4edc4
7b20858
6d4edc4
 
 
7b20858
6d4edc4
7b20858
6d4edc4
7b20858
6d4edc4
7b20858
6d4edc4
 
 
 
 
 
3e5d9eb
 
6d4edc4
 
76677df
6d4edc4
76677df
 
 
 
 
8f07720
6d4edc4
 
 
 
 
 
 
 
 
76677df
 
3e5d9eb
6d4edc4
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: apache-2.0
base_model:
- openai/gpt-oss-120b
---

# Model Overview

- **Model Architecture:** gpt-oss-120b
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.2.0
- **PyTorch**: 2.9.0
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html)
  - **moe**
    - **Weight quantization:** OCP MXFP4, Static
    - **Activation quantization:** FP8, Dynamic
  - **qkvo**
    - **Weight quantization:** FP8 per_channel, Static
    - **Activation quantization:** FP8 per_token, Dynamic
  - **kv-cache**
    - **Output quantization:** FP8, Static
  - **softmax**
    - **Output quantization:** FP8, Static
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with gpt-oss-120b model by applying [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html) for mixed MXFP4-FP8 quantization.

# Model Quantization

The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8. 

**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*lm_head *router*"

python3 internal_scripts/quantize_quark.py \
    --model_dir openai/gpt-oss-120b \
    --quant_scheme mxfp4_fp8 \
    --layer_quant_scheme *q_proj ptpc_fp8 \
    --layer_quant_scheme *k_proj ptpc_fp8 \
    --layer_quant_scheme *v_proj ptpc_fp8 \
    --layer_quant_scheme *o_proj ptpc_fp8 \
    --kv_cache_dtype fp8 \
    --attention_dtype fp8 \
    --exclude_layers $exclude_layers \
    --num_calib_data 512 \
    --output_dir amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
    --model_export hf_format \
    --multi_gpu
```

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

## Evaluation
The model was evaluated on AIME25 and GPQA Diamond benchmarks with `medium` reasoning effort. 

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>gpt-oss-120b </strong>
   </td>
   <td><strong>gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>GPQA 
   </td>
   <td>71.21
   </td>
   <td>71.16
   </td>
   <td>99.93%
   </td>
  </tr>
  <tr>
   <td>AIME25 
   </td>
   <td>78.61
   </td>
   <td>77.08
   </td>
   <td>98.06%
   </td>
  </tr>
</table>

### Reproduction

The results of GPQA Diamond and AIME25 were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-private:mxfp4_fp8_gpt_oss_native_20251226`.
vLLM and AITER are already compiled and pre-installed in the Docker image, there is no need to download or install them again.

#### Launching server

```
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_AITER_FUSED_MOE_A16W4=0
export USE_Q_SCALE=1

vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
  --tensor_parallel_size 2 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --max-num-batched-tokens 1024 \
  --kv_cache_dtype='fp8'
```

#### Evaluating model in a new terminal
```
export OPENAI_API_KEY="EMPTY"

python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn --eval gpqa,aime25 --reasoning-effort medium --n-threads 128
```

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.