jiaxwang
/

test

Model card Files Files and versions

xet

Community

jiaxwang commited on Dec 29, 2025

Commit

6d4edc4

verified ·

1 Parent(s): 4702ac3

Update README.md

Browse files

draft for gpt-oss120b-moe_w-mxfp4-a-fp8-attn_ptpc-kv-soft_fp8

Files changed (1) hide show

README.md +111 -3

README.md CHANGED Viewed

@@ -1,3 +1,111 @@
----
-license: mit
----

+---
+license: apache-2.0
+base_model:
+- openai/gpt-oss-120b
+---
+# Model Overview
+- **Model Architecture:** gpt-oss-120b
+  - **Input:** Text
+  - **Output:** Text
+- **Supported Hardware Microarchitecture:** AMD MI350/MI355
+- **ROCm**: 6.14.14
+- **Operating System(s):** Linux
+- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
+- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
+  - **Weight quantization:** OCP MXFP4, Static
+  - **Activation quantization:** FP8, Dynamic
+- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
+This model was built with gpt-oss-120b model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
+# Model Quantization
+The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8.
+**Quantization scripts:**
+```
+cd Quark/examples/torch/language_modeling/llm_ptq/
+exclude_layers="*lm_head *self_attn* *router*"
+python3 internal_scripts/quantize_quark.py \
+    --model_dir openai/gpt-oss-120b \
+    --quant_scheme mxfp4_fp8 \
+    --layer_quant_scheme *q_proj ptpc_fp8 \
+    --layer_quant_scheme *k_proj ptpc_fp8 \
+    --layer_quant_scheme *v_proj ptpc_fp8 \
+    --layer_quant_scheme *o_proj ptpc_fp8 \
+    --kv_cache_dtype fp8 \
+    --attention_dtype fp8 \
+    --exclude_layers $exclude_layers \
+    --num_calib_data 512 \
+    --output_dir amd/gpt-oss120b-w-mxfp4-a-fp8 \
+    --model_export hf_format \
+    --multi_gpu
+```
+# Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
+## Evaluation
+The model was evaluated on AIME25 and GPQA Diamond benchmarks with `medium` reasoning effort.
+### Accuracy
+<table>
+  <tr>
+   <td><strong>Benchmark</strong>
+   </td>
+   <td><strong>gpt-oss-120b </strong>
+   </td>
+   <td><strong>gpt-oss120b-moe_w-mxfp4-a-fp8-attn_ptpc-kv-soft_fp8(this model)</strong>
+   </td>
+   <td><strong>Recovery</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>AIME25
+   </td>
+   <td>78.47
+   </td>
+   <td>78.33
+   </td>
+   <td>99.82%
+   </td>
+  </tr>
+  <tr>
+   <td>GPQA
+   </td>
+   <td>71.86
+   </td>
+   <td>71.86
+   </td>
+   <td>100.00%
+   </td>
+  </tr>
+</table>
+### Reproduction
+The results of AIME25 and GPQA Diamond were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-dev:mxfp4_fp8_gpt_oss_native_20251226`.
+#### Launching server
+```
+vllm serve amd/gpt-oss120b-moe_w-mxfp4-a-fp8-attn_ptpc-kv-soft_fp8 \
+  --tensor_parallel_size 2 \
+  --gpu-memory-utilization 0.90 \
+  --no-enable-prefix-caching \
+  --max-num-batched-tokens 1024 \
+  --kv_cache_dtype='fp8'
+```
+#### Evaluating model in a new terminal
+```
+python -m gpt_oss.evals --model /shareddata/amd/gpt-oss120b-moe_w-mxfp4-a-fp8-attn_ptpc-kv-soft_fp8 --eval aime25,gpqa --reasoning-effort medium --n-threads 128
+```
+# License
+Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.