--- license: apache-2.0 base_model: - openai/gpt-oss-120b --- # Model Overview - **Model Architecture:** gpt-oss-120b - **Input:** Text - **Output:** Text - **Supported Hardware Microarchitecture:** AMD MI350/MI355 - **ROCm**: 7.2.0 - **PyTorch**: 2.9.0 - **Operating System(s):** Linux - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) - **Model Optimizer:** [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html) - **moe** - **Weight quantization:** OCP MXFP4, Static - **Activation quantization:** FP8, Dynamic - **qkvo** - **Weight quantization:** FP8 per_channel, Static - **Activation quantization:** FP8 per_token, Dynamic - **kv-cache** - **Output quantization:** FP8, Static - **softmax** - **Output quantization:** FP8, Static - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) This model was built with gpt-oss-120b model by applying [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html) for mixed MXFP4-FP8 quantization. # Model Quantization The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8. **Quantization scripts:** ``` cd Quark/examples/torch/language_modeling/llm_ptq/ exclude_layers="*lm_head *router*" python3 internal_scripts/quantize_quark.py \ --model_dir openai/gpt-oss-120b \ --quant_scheme mxfp4_fp8 \ --layer_quant_scheme *q_proj ptpc_fp8 \ --layer_quant_scheme *k_proj ptpc_fp8 \ --layer_quant_scheme *v_proj ptpc_fp8 \ --layer_quant_scheme *o_proj ptpc_fp8 \ --kv_cache_dtype fp8 \ --attention_dtype fp8 \ --exclude_layers $exclude_layers \ --num_calib_data 512 \ --output_dir amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \ --model_export hf_format \ --multi_gpu ``` # Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. ## Evaluation The model was evaluated on AIME25 and GPQA Diamond benchmarks with `medium` reasoning effort. ### Accuracy
Benchmark gpt-oss-120b gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn(this model) Recovery
GPQA 71.21 71.16 99.93%
AIME25 78.61 77.08 98.06%
### Reproduction The results of GPQA Diamond and AIME25 were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-private:mxfp4_fp8_gpt_oss_native_20251226`. vLLM and AITER are already compiled and pre-installed in the Docker image, there is no need to download or install them again. #### Launching server ``` export VLLM_USE_AITER_UNIFIED_ATTENTION=1 export VLLM_ROCM_USE_AITER_MHA=0 export VLLM_ROCM_USE_AITER_FUSED_MOE_A16W4=0 export USE_Q_SCALE=1 vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \ --tensor_parallel_size 2 \ --gpu-memory-utilization 0.90 \ --no-enable-prefix-caching \ --max-num-batched-tokens 1024 \ --kv_cache_dtype='fp8' ``` #### Evaluating model in a new terminal ``` export OPENAI_API_KEY="EMPTY" python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn --eval gpqa,aime25 --reasoning-effort medium --n-threads 128 ``` # License Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.