|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- openai/gpt-oss-120b |
|
|
--- |
|
|
|
|
|
# Model Overview |
|
|
|
|
|
- **Model Architecture:** gpt-oss-120b |
|
|
- **Input:** Text |
|
|
- **Output:** Text |
|
|
- **Supported Hardware Microarchitecture:** AMD MI350/MI355 |
|
|
- **ROCm**: 7.2.0 |
|
|
- **PyTorch**: 2.9.0 |
|
|
- **Operating System(s):** Linux |
|
|
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) |
|
|
- **Model Optimizer:** [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html) |
|
|
- **moe** |
|
|
- **Weight quantization:** OCP MXFP4, Static |
|
|
- **Activation quantization:** FP8, Dynamic |
|
|
- **qkvo** |
|
|
- **Weight quantization:** FP8 per_channel, Static |
|
|
- **Activation quantization:** FP8 per_token, Dynamic |
|
|
- **kv-cache** |
|
|
- **Output quantization:** FP8, Static |
|
|
- **softmax** |
|
|
- **Output quantization:** FP8, Static |
|
|
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) |
|
|
|
|
|
This model was built with gpt-oss-120b model by applying [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html) for mixed MXFP4-FP8 quantization. |
|
|
|
|
|
# Model Quantization |
|
|
|
|
|
The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8. |
|
|
|
|
|
**Quantization scripts:** |
|
|
``` |
|
|
cd Quark/examples/torch/language_modeling/llm_ptq/ |
|
|
exclude_layers="*lm_head *router*" |
|
|
|
|
|
python3 internal_scripts/quantize_quark.py \ |
|
|
--model_dir openai/gpt-oss-120b \ |
|
|
--quant_scheme mxfp4_fp8 \ |
|
|
--layer_quant_scheme *q_proj ptpc_fp8 \ |
|
|
--layer_quant_scheme *k_proj ptpc_fp8 \ |
|
|
--layer_quant_scheme *v_proj ptpc_fp8 \ |
|
|
--layer_quant_scheme *o_proj ptpc_fp8 \ |
|
|
--kv_cache_dtype fp8 \ |
|
|
--attention_dtype fp8 \ |
|
|
--exclude_layers $exclude_layers \ |
|
|
--num_calib_data 512 \ |
|
|
--output_dir amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \ |
|
|
--model_export hf_format \ |
|
|
--multi_gpu |
|
|
``` |
|
|
|
|
|
# Deployment |
|
|
### Use with vLLM |
|
|
|
|
|
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. |
|
|
|
|
|
## Evaluation |
|
|
The model was evaluated on AIME25 and GPQA Diamond benchmarks with `medium` reasoning effort. |
|
|
|
|
|
### Accuracy |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<td><strong>Benchmark</strong> |
|
|
</td> |
|
|
<td><strong>gpt-oss-120b </strong> |
|
|
</td> |
|
|
<td><strong>gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn(this model)</strong> |
|
|
</td> |
|
|
<td><strong>Recovery</strong> |
|
|
</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>GPQA |
|
|
</td> |
|
|
<td>71.21 |
|
|
</td> |
|
|
<td>71.16 |
|
|
</td> |
|
|
<td>99.93% |
|
|
</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>AIME25 |
|
|
</td> |
|
|
<td>78.61 |
|
|
</td> |
|
|
<td>77.08 |
|
|
</td> |
|
|
<td>98.06% |
|
|
</td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
### Reproduction |
|
|
|
|
|
The results of GPQA Diamond and AIME25 were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-private:mxfp4_fp8_gpt_oss_native_20251226`. |
|
|
vLLM and AITER are already compiled and pre-installed in the Docker image, there is no need to download or install them again. |
|
|
|
|
|
#### Launching server |
|
|
|
|
|
``` |
|
|
export VLLM_USE_AITER_UNIFIED_ATTENTION=1 |
|
|
export VLLM_ROCM_USE_AITER_MHA=0 |
|
|
export VLLM_ROCM_USE_AITER_FUSED_MOE_A16W4=0 |
|
|
export USE_Q_SCALE=1 |
|
|
|
|
|
vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \ |
|
|
--tensor_parallel_size 2 \ |
|
|
--gpu-memory-utilization 0.90 \ |
|
|
--no-enable-prefix-caching \ |
|
|
--max-num-batched-tokens 1024 \ |
|
|
--kv_cache_dtype='fp8' |
|
|
``` |
|
|
|
|
|
#### Evaluating model in a new terminal |
|
|
``` |
|
|
export OPENAI_API_KEY="EMPTY" |
|
|
|
|
|
python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn --eval gpqa,aime25 --reasoning-effort medium --n-threads 128 |
|
|
``` |
|
|
|
|
|
# License |
|
|
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |