jiaxwang commited on
Commit
76677df
·
verified ·
1 Parent(s): 7b20858

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -6
README.md CHANGED
@@ -14,16 +14,24 @@ base_model:
14
  - **PyTorch**: 2.9.0
15
  - **Operating System(s):** Linux
16
  - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
17
- - **Model Optimizer:** [AMD-Quark (v0.10)](https://quark.docs.amd.com/latest/index.html)
18
- - **Weight quantization:** OCP MXFP4, Static
19
- - **Activation quantization:** FP8, Dynamic
 
 
 
 
 
 
 
 
20
  - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
21
 
22
- This model was built with gpt-oss-120b model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for mixed MXFP4-FP8 quantization.
23
 
24
  # Model Quantization
25
 
26
- The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8.
27
 
28
  **Quantization scripts:**
29
  ```
@@ -92,10 +100,16 @@ The model was evaluated on AIME25 and GPQA Diamond benchmarks with `medium` reas
92
  ### Reproduction
93
 
94
  The results of GPQA Diamond and AIME25 were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-dev:mxfp4_fp8_gpt_oss_native_20251226`.
95
- This version od vllm is
96
 
97
  #### Launching server
 
98
  ```
 
 
 
 
 
99
  vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
100
  --tensor_parallel_size 2 \
101
  --gpu-memory-utilization 0.90 \
@@ -106,6 +120,8 @@ vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
106
 
107
  #### Evaluating model in a new terminal
108
  ```
 
 
109
  python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn --eval aime25,gpqa --reasoning-effort medium --n-threads 128
110
  ```
111
 
 
14
  - **PyTorch**: 2.9.0
15
  - **Operating System(s):** Linux
16
  - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
17
+ - **Model Optimizer:** [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html)
18
+ - **moe**
19
+ - **Weight quantization:** OCP MXFP4, Static
20
+ - **Activation quantization:** FP8, Dynamic
21
+ - **qkvo**
22
+ - **Weight quantization:** FP8 per_channel, Static
23
+ - **Activation quantization:** FP8 per_token, Dynamic
24
+ - **kv-cache**
25
+ - **Output quantization:** FP8, Static
26
+ - **softmax**
27
+ - **Output quantization:** FP8, Static
28
  - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
29
 
30
+ This model was built with gpt-oss-120b model by applying [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html) for mixed MXFP4-FP8 quantization.
31
 
32
  # Model Quantization
33
 
34
+ The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8.
35
 
36
  **Quantization scripts:**
37
  ```
 
100
  ### Reproduction
101
 
102
  The results of GPQA Diamond and AIME25 were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-dev:mxfp4_fp8_gpt_oss_native_20251226`.
103
+ vLLM and Aiter are already compiled and pre-installed in the Docker image, there is no need to download or install them again.
104
 
105
  #### Launching server
106
+
107
  ```
108
+ export VLLM_USE_AITER_UNIFIED_ATTENTION=1
109
+ export VLLM_ROCM_USE_AITER_MHA=0
110
+ export VLLM_ROCM_USE_AITER_FUSED_MOE_A16W4=0
111
+ export USE_Q_SCALE=1
112
+
113
  vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
114
  --tensor_parallel_size 2 \
115
  --gpu-memory-utilization 0.90 \
 
120
 
121
  #### Evaluating model in a new terminal
122
  ```
123
+ export OPENAI_API_KEY="EMPTY"
124
+
125
  python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn --eval aime25,gpqa --reasoning-effort medium --n-threads 128
126
  ```
127