jiaxwang commited on
Commit
6d4edc4
·
verified ·
1 Parent(s): 4702ac3

Update README.md

Browse files

draft for gpt-oss120b-moe_w-mxfp4-a-fp8-attn_ptpc-kv-soft_fp8

Files changed (1) hide show
  1. README.md +111 -3
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - openai/gpt-oss-120b
5
+ ---
6
+
7
+ # Model Overview
8
+
9
+ - **Model Architecture:** gpt-oss-120b
10
+ - **Input:** Text
11
+ - **Output:** Text
12
+ - **Supported Hardware Microarchitecture:** AMD MI350/MI355
13
+ - **ROCm**: 6.14.14
14
+ - **Operating System(s):** Linux
15
+ - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
16
+ - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
17
+ - **Weight quantization:** OCP MXFP4, Static
18
+ - **Activation quantization:** FP8, Dynamic
19
+ - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
20
+
21
+ This model was built with gpt-oss-120b model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
22
+
23
+ # Model Quantization
24
+
25
+ The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8.
26
+
27
+ **Quantization scripts:**
28
+ ```
29
+ cd Quark/examples/torch/language_modeling/llm_ptq/
30
+ exclude_layers="*lm_head *self_attn* *router*"
31
+
32
+ python3 internal_scripts/quantize_quark.py \
33
+ --model_dir openai/gpt-oss-120b \
34
+ --quant_scheme mxfp4_fp8 \
35
+ --layer_quant_scheme *q_proj ptpc_fp8 \
36
+ --layer_quant_scheme *k_proj ptpc_fp8 \
37
+ --layer_quant_scheme *v_proj ptpc_fp8 \
38
+ --layer_quant_scheme *o_proj ptpc_fp8 \
39
+ --kv_cache_dtype fp8 \
40
+ --attention_dtype fp8 \
41
+ --exclude_layers $exclude_layers \
42
+ --num_calib_data 512 \
43
+ --output_dir amd/gpt-oss120b-w-mxfp4-a-fp8 \
44
+ --model_export hf_format \
45
+ --multi_gpu
46
+ ```
47
+
48
+ # Deployment
49
+ ### Use with vLLM
50
+
51
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
52
+
53
+ ## Evaluation
54
+ The model was evaluated on AIME25 and GPQA Diamond benchmarks with `medium` reasoning effort.
55
+
56
+ ### Accuracy
57
+
58
+ <table>
59
+ <tr>
60
+ <td><strong>Benchmark</strong>
61
+ </td>
62
+ <td><strong>gpt-oss-120b </strong>
63
+ </td>
64
+ <td><strong>gpt-oss120b-moe_w-mxfp4-a-fp8-attn_ptpc-kv-soft_fp8(this model)</strong>
65
+ </td>
66
+ <td><strong>Recovery</strong>
67
+ </td>
68
+ </tr>
69
+ <tr>
70
+ <td>AIME25
71
+ </td>
72
+ <td>78.47
73
+ </td>
74
+ <td>78.33
75
+ </td>
76
+ <td>99.82%
77
+ </td>
78
+ </tr>
79
+ <tr>
80
+ <td>GPQA
81
+ </td>
82
+ <td>71.86
83
+ </td>
84
+ <td>71.86
85
+ </td>
86
+ <td>100.00%
87
+ </td>
88
+ </tr>
89
+ </table>
90
+
91
+ ### Reproduction
92
+
93
+ The results of AIME25 and GPQA Diamond were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-dev:mxfp4_fp8_gpt_oss_native_20251226`.
94
+
95
+ #### Launching server
96
+ ```
97
+ vllm serve amd/gpt-oss120b-moe_w-mxfp4-a-fp8-attn_ptpc-kv-soft_fp8 \
98
+ --tensor_parallel_size 2 \
99
+ --gpu-memory-utilization 0.90 \
100
+ --no-enable-prefix-caching \
101
+ --max-num-batched-tokens 1024 \
102
+ --kv_cache_dtype='fp8'
103
+ ```
104
+
105
+ #### Evaluating model in a new terminal
106
+ ```
107
+ python -m gpt_oss.evals --model /shareddata/amd/gpt-oss120b-moe_w-mxfp4-a-fp8-attn_ptpc-kv-soft_fp8 --eval aime25,gpqa --reasoning-effort medium --n-threads 128
108
+ ```
109
+
110
+ # License
111
+ Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.